Show HN: Sir – Git-diff-able JSON database on yer filesystem

(github.com)

1. yodon ◴[06 Nov 20 04:01 UTC] No.25004762[source]▶

I've often looked for a simple human readable file-based DB-like system like this. Are there other competing packages like this?

replies(6): >>25004905 #>>25004915 #>>25004929 #>>25005021 #>>25005739 #>>25006504 #

2. sillysaurusx ◴[06 Nov 20 04:17 UTC] No.25004844[source]▶

>>25004554 (OP) #

Amusingly, this is almost identical to how HN’s database works. Or worked.

It wasn’t json, but it probably could’ve been.

replies(2): >>25005309 #>>25005392 #

3. dreamer7 ◴[06 Nov 20 04:27 UTC] No.25004897[source]▶

>>25004554 (OP) #

Slightly off topic - can we use git-diff-able databases for faster syncing of clients with servers?

Or is there no practical difference?

We can return just the latest updated documents by maintaining a lastUpdated timestamp on client and server docs. But this would be at a document level. So, only the fields that have changed would be sent to the client.

replies(3): >>25004919 #>>25005083 #>>25005831 #

4. LVB ◴[06 Nov 20 04:29 UTC] No.25004905[source]▶

>>25004762 #

This is sort of in the space: https://github.com/msiemens/tinydb

5. codenesium ◴[06 Nov 20 04:30 UTC] No.25004913[source]▶

>>25004554 (OP) #

Is sqlite not diffable?

replies(3): >>25005043 #>>25005217 #>>25005391 #

6. graderjs ◴[06 Nov 20 04:30 UTC] No.25004915[source]▶

>>25004762 #

I just had a quick search, didn't turn up much but there is this

https://github.com/pksunkara/nosqlite

I could add a section

7. graderjs ◴[06 Nov 20 04:32 UTC] No.25004919[source]▶

>>25004897 #

I'm not sure if it would be faster or less data but this is certainly diffable at the document>-field level. Each field gets its own line at least

8. asaibx ◴[06 Nov 20 04:34 UTC] No.25004929[source]▶

>>25004762 #

There's always recutils [0] which uses a plain text database format stored in files called recfiles.

[0] https://labs.tomasino.org/gnu-recutils/

replies(1): >>25004983 #

9. graderjs ◴[06 Nov 20 04:44 UTC] No.25004983{3}[source]▶

>>25004929 #

I should add that I was inspired by recutils and considered using it before making this. The appeal of using common Unix tools to play with a database is a motivation for me

10. jitl ◴[06 Nov 20 04:54 UTC] No.25005021[source]▶

>>25004762 #

I feel like many static site generators have more-or-less filesystem-as-database. Some are more specialized than others. Many SSGs have standardized around the format of `YAML: Metadata\n\n---\n\nContent`, and expose a query interface over the YAML metadata. But, most static site generators don't generally have create/update API - just a read API.

A reason you don't see much in the "tree of simple files as my DB" space is that the whole point of it is to be trivially simple to understand, and thus, to implement.

Now, with file DBs, it's actually quite important to use atomic writes, so that a crash or concurrent operations don't produce errors. For example, the package posted above can corrupt data if there's concurrent writers to the same data file, because all the writes are non-atomic fs.writeFileSync(). What this database should do instead is write to a tempfile, and then rename the tempfile to replace the destination. That way, you get simple last-write-wins semantics with no possibility of creating invalid JSONs.

11. jitl ◴[06 Nov 20 05:00 UTC] No.25005043[source]▶

>>25004913 #

SQLite database files themselves are binary, which typically isn't diff-able with standard tools, so (eg) git wouldn't show you anything useful unless you configure a special tool for diffing sqlite files.

And in fact, there is a sqlitediff.exe tool for diffing sqlite databases: https://sqlite.org/sqldiff.html

But, it would still be a bummer to store them in git because git won't efficiently track changes between sqlite databases, it'll just add the whole binary blob on every commit. oof.

replies(1): >>25005799 #

12. andreareina ◴[06 Nov 20 05:01 UTC] No.25005050[source]▶

>>25004554 (OP) #

Could use an example of the database file and diffs.

I have vague recollections of a tool that was similarly supposed to make JSON files diffable. IIRC it procedurally built up the object, so {"foo": {"bar": "baz", "quux": [1, 2, 3]}} would be:

    foo = {}
    foo.bar = "baz"
    foo.baz = []
    foo.baz[0] = 1
    ...

replies(3): >>25005118 #>>25005451 #>>25005522 #

13. zmj ◴[06 Nov 20 05:09 UTC] No.25005083[source]▶

>>25004897 #

Never trust a clock.

replies(2): >>25005386 #>>25010854 #

14. PowerfulWizard ◴[06 Nov 20 05:17 UTC] No.25005118[source]▶

>>25005050 #

This is called gron, I prefer it for reading JSON with a lot of nesting.

https://github.com/tomnomnom/gron

replies(2): >>25005274 #>>25005506 #

15. uvic ◴[06 Nov 20 05:28 UTC] No.25005173[source]▶

>>25004554 (OP) #

Wow this was something I was looking for...Add a hireme button to your README.md so you can get hired using wannahireme.com...full disclosure I made the site.

replies(1): >>25008818 #

16. alilsumpinditto ◴[06 Nov 20 05:36 UTC] No.25005212[source]▶

>>25004554 (OP) #

what's the use case? certainly the performance impact will suffer at scale, so what's the usable limit before disadvantages outweigh the advantages of being able to use git?

replies(2): >>25005483 #>>25010840 #

17. Jakob ◴[06 Nov 20 05:36 UTC] No.25005217[source]▶

>>25004913 #

I would export the DB and use that for your diff/version control:

  sqlite3 yourdb.sqlite .dump

If you really want to track the binary e.g. inside git, but still see textual diffs you would put that into your .gitconfig:

  [diff "sqlite3"]
    textconv = sqlite3 $1 .dump

replies(1): >>25005630 #

18. robador ◴[06 Nov 20 05:50 UTC] No.25005274{3}[source]▶

>>25005118 #

This is amazing!

19. XorNot ◴[06 Nov 20 05:59 UTC] No.25005309[source]▶

>>25004844 #

I've definitely also written a big chunk of something like this (including JSON) for an internal tool.

Just didn't get to realising my folders were generic tables.

20. andreareina ◴[06 Nov 20 06:16 UTC] No.25005386{3}[source]▶

>>25005083 #

Yeah wall clock isn't reliable. But if the server could send a global version/logical clock, then the client can say this is the latest version I've got. Which is essentially what tcp, git, etc do.

21. simonw ◴[06 Nov 20 06:17 UTC] No.25005391[source]▶

>>25004913 #

I wrote a tool that dumps a SQLite database out to disk as diffable JSON, precisely so you can easily record changes made to the database over time: https://github.com/simonw/sqlite-diffable

I have a demo of it running against my blog's database here: https://github.com/simonw/simonwillisonblog-backup

replies(1): >>25005974 #

22. dheera ◴[06 Nov 20 06:17 UTC] No.25005392[source]▶

>>25004844 #

I'd go for YAML instead of JSON, or "JSON with trailing commas", or else the diffs become messy.

replies(1): >>25005450 #

23. jeswin ◴[06 Nov 20 06:26 UTC] No.25005450{3}[source]▶

>>25005392 #

YAMLs biggest advantage over JSON: records can be appended (at the end of an existing file). Which you can't do with JSON objects, arrays etc because of the closing bracket at the end.

24. graderjs ◴[06 Nov 20 06:26 UTC] No.25005451[source]▶

>>25005050 #

The example was a good idea, thanks! I added it

25. graderjs ◴[06 Nov 20 06:35 UTC] No.25005483[source]▶

>>25005212 #

I don't know about performance benchmarks to other DBs or scale, but I ran read and write benchmarks and saw 1000s op/s and felt satisfied enough. It won't be faster than filesystem unless you mount the DB directory to a RAM disk.

For my use case, speed was not the most important, instead it was using common tools on the DB, being human-readable, in JSON and git diff-able. Also, I wanted small resource usage and small library size.

26. divbzero ◴[06 Nov 20 06:39 UTC] No.25005506{3}[source]▶

>>25005118 #

This is really cool. Gives you much of what jq does without having to remember jq syntax.

27. emmelaich ◴[06 Nov 20 06:43 UTC] No.25005522[source]▶

>>25005050 #

Augtool from the Augeas project too. (As well as the mentioned gron)

replies(1): >>25005552 #

28. emmelaich ◴[06 Nov 20 06:49 UTC] No.25005552{3}[source]▶

>>25005522 #

example

   $ augtool -r . --noautoload --transform "Json.lns incl /sirdb-example.json" <<EOF
   > print /files/sirdb-example.json/
   > EOF
   /files/sirdb-example.json
   /files/sirdb-example.json/dict
   /files/sirdb-example.json/dict/entry[1] = "_owner"
   /files/sirdb-example.json/dict/entry[1]/string = "nouser"
   /files/sirdb-example.json/dict/entry[2] = "username"
   /files/sirdb-example.json/dict/entry[2]/string = "nouser"
   /files/sirdb-example.json/dict/entry[3] = "email"
   .... etc ...

replies(1): >>25008411 #

29. djeiasbsbo ◴[06 Nov 20 07:07 UTC] No.25005630{3}[source]▶

>>25005217 #

To expand on this, you can now even export to markdown and basically have a nice overview of the database. I use this in my notes which are all written in markdown because markdown tables are painful to organise per hand.

30. legulere ◴[06 Nov 20 07:25 UTC] No.25005739[source]▶

>>25004762 #

DBASE is largely text-based

31. framecowbird ◴[06 Nov 20 07:37 UTC] No.25005799{3}[source]▶

>>25005043 #

This might be a silly question, but is there a reason why git can't also diff binary files? If I have a huge binary file and I change a few bytes in the middle, what's stopping Git just checking that in as a diff? I don't see what's so special about text...

[edit] just did some research and it looks like Git will store the delta of a binary file in its packfiles, just like a text file. The question is just how delta-able sqlite binary files are.

replies(3): >>25005829 #>>25006385 #>>25006508 #

32. graderjs ◴[06 Nov 20 07:43 UTC] No.25005829{4}[source]▶

>>25005799 #

I know Drew from Dropbox implemented a binary diff for his first version (on a bus) but I don't know the details. I don't know if git can somehow...but I suppose you could always MIME encode it (like email base64 encoding).

33. moonchild ◴[06 Nov 20 07:43 UTC] No.25005831[source]▶

>>25004897 #

An mutch easier solution that gets you 90% of the way is to just use a binary patcher (like bsdiff) with whatever database format you're already using.

replies(1): >>25006433 #

34. richie5um ◴[06 Nov 20 07:58 UTC] No.25005893[source]▶

>>25004554 (OP) #

I wrote an NPM package to convert JSON objects to/from a dotted format. It doesn't (currently) sort the flattened output, but that would be easy to add.

https://www.npmjs.com/package/flattenjs

35. jlelse ◴[06 Nov 20 08:19 UTC] No.25005974{3}[source]▶

>>25005391 #

Thanks for this! I am trying to integrate something similar into my custom CMS.

36. olvy0 ◴[06 Nov 20 08:23 UTC] No.25005993[source]▶

>>25004554 (OP) #

I wrote an internal tool which is similar but for a more complicated use-case: synchronizing an hierarchical database into an hierarchy of json files, in a git repo, which are then automatically or manually committed by the user.

Our internal database is deeply hierarchical and also has a concept of a "folder" record, which users can create and nest. Users have complained for years that there was no change tracking. I surveyed some of the tools mentioned here (and other commercial ones) but they all just flattened the database into json or XML or just raw sql files, which made it very difficult for users to see the changes.

We also thought about getting rid of the database and running off of json files, but that would have required completely rewriting our already brittle internal ORM, which uses raw SQL in too many places for performance, and would have been a breaking change. I have a heavy dislike for rewriting or throwing away whole modules, and also dislike breaking changes, having been personally bitten by them many times in the past. And due to other constraints on my team this was unacceptable. It's still on the table, just not for now.

Instead I wrote a layer which translates back and forth between the ORM (or the raw database) and the json files in the file system. This is an addon to the application and is written in such a way that is mostly transparent to the users. Each change to a record in the database is translated into the correct json file on disk. Users can commit and push at their leisure. Users can also switch branches / reset to an earlier version, and request the database to be updated from the json files.

I had thought many times about open sourcing the basic part of this addon. But I was never sure how common this use-case is. Also, it's actually pretty trivial to write the base part. The hard parts were gluing it to the existing application/ORM, and working around the many edge cases and bugs in our existing application.

So, I wonder, how common is this use case (version control of an existing hierarchical SQL database in a user friendly way) in other organizations?

replies(3): >>25006628 #>>25006820 #>>25008799 #

37. dijksterhuis ◴[06 Nov 20 08:54 UTC] No.25006125[source]▶

>>25004554 (OP) #

I generate bunch of JSON data from my machine learning runs (a bunch of outputs for analysis), so this fits my use case perfectly (inspecting changes of different run settings).

Although you already got my upvote for the use of yer... (I use it all the time and rarely see anyone using it, yagetme).

38. GordonS ◴[06 Nov 20 09:42 UTC] No.25006385{4}[source]▶

>>25005799 #

Diffing is about more than just efficient storage though, it's also about showing a history of human-readable changes (which obv you don't get with a binary format).

replies(1): >>25012651 #

39. Scaevolus ◴[06 Nov 20 09:51 UTC] No.25006433{3}[source]▶

>>25005831 #

bsdiff doesn't scale to gigabytes-- vcdiff or rsync will work better

40. leoncvlt ◴[06 Nov 20 10:10 UTC] No.25006504[source]▶

>>25004762 #

This is something I wrote for a small electron application that needed some simple database functionality: https://www.npmjs.com/package/minim-json-db

Like the OP, stores data locally in .json files, and uses a simple MongoDB-inspired API

replies(1): >>25006571 #

41. andrewshadura ◴[06 Nov 20 10:11 UTC] No.25006508{4}[source]▶

>>25005799 #

AFAIR Sqlite files are normally in append mode.

42. graderjs ◴[06 Nov 20 10:24 UTC] No.25006571{3}[source]▶

>>25006504 #

Thanks! I'll add to related projects

43. ◴[06 Nov 20 10:39 UTC] No.25006628[source]▶

>>25005993 #

44. ◴[06 Nov 20 11:20 UTC] No.25006820[source]▶

>>25005993 #

45. chousuke ◴[06 Nov 20 15:45 UTC] No.25008411{4}[source]▶

>>25005552 #

I find it unfortunate that augeas is so magical. Or rather, the design of its interface is very surprising

What's up with requiring "--noautoload"? What does "-r ." do? and the transform invocation starts with the seemingly magical "Json.lns" (where does it come from?) and appears to refer to a file at the root of the filesystem hierarchy?

I'd probably use augtool more often myself, but its UI is seriously inscrutable.

replies(1): >>25012012 #

46. lootsauce ◴[06 Nov 20 16:27 UTC] No.25008799[source]▶

>>25005993 #

I would love to see this open sourced live the marriage of a database, git, and json files!

replies(1): >>25013847 #

47. ◴[06 Nov 20 16:29 UTC] No.25008818[source]▶

>>25005173 #

48. gumby ◴[06 Nov 20 20:00 UTC] No.25010612[source]▶

>>25004554 (OP) #

> git-diffable, and therefore versionable

You can get this behavior with a SQL database like mysql that has a text dump format. I"ve used this quite effectively to keep relatively small (but important) databases in a git repo.

I can definitely appreciate the advantages of an export format that is explicitly designed to support this (e.g. outputting each field on its own line, perhaps using an explicit label in front of it) but even the out-of-the-box format can often be quite effective.

49. joshspankit ◴[06 Nov 20 20:30 UTC] No.25010840[source]▶

>>25005212 #

I personally feel like the performance of DBs for smaller projects is a bit of a stale topic.

With SSDs at 2GBps and over, that’s on par with early server RAM and there were companies who were able to serve pretty significant loads with 2GBps RAM.

50. joshspankit ◴[06 Nov 20 20:32 UTC] No.25010854{3}[source]▶

>>25005083 #

How about a version chain?

51. emmelaich ◴[06 Nov 20 23:11 UTC] No.25012012{5}[source]▶

>>25008411 #

I agree, fwiw.

I'm writing a little program to make this whole lot easier.

Answers: by default, augtool loads all the lenses and scans the entire default filesystem (typically /etc); `--noautoload` prevents that.

The -r is `--root`; specify the root as the current directory rather than /

52. graderjs ◴[07 Nov 20 01:38 UTC] No.25012651{5}[source]▶

>>25006385 #

That's a good point, thank you

53. olvy0 ◴[07 Nov 20 08:36 UTC] No.25013847{3}[source]▶

>>25008799 #

Thank you, I'll think about cleaning and open sourcing the core once we finish deploying it. Currently I'm overwhelmed by work anyway.