Really cool although quite an undertaking to build an entire SQL engine! I have been working on something pretty similar but using Postgres instead. Basically extending Postgres to run stateless on top of FoundationDB, which would achieve the same thing but with all the Postgres features one would expect (and without some quirks you might not want, like vacuuming).
Working with FoundationDB is a real pleasure as many people have noted already, a very intuitive abstraction to build databases on top of.
The five second timeout remains and trickles through to apply to Postgres transactions instead. This project would very much be for OLTP, so not really fit for using Postgres for OLAP or hybrid workloads.
The N+1 issue is a really interesting one which I have a plan for but haven't implemented yet. FoundationDB has what they call mapped ranges [0] to help with this, which works in some cases. More generally, one should make sure to issue read request as soon as possible and not on-demand, given that FoundationDB clients have a futures-based design. This is slightly tricky in Postgres because internally it has a pull-based model where one tuple is pulled from the execution plan at a time, so one needs to implement pre-fetching and not make a read against FDB each time a new tuple is pulled.
Right, you'll need to be careful to adjust the read ranges manually to skip pre-fetched but unread tuples. Other aspects that create complexity include SQL's lock based semantics, and that FDB only offers strictly serializable transactions with optimistic concurrency control (requires looping at the originator of the transaction), whereas SQL based apps assume pessimistic non-looping concurrency control.
I remember learning about FoundationDB a decade ago and being deeply impressed with what they built. Then it was acquired by Apple and went silent. Since then we've seen an explosion in new database storage layers. I'm curious is FoundationDB still the new hotness or has it been replaced by newer better technologies?
FoundationDB is pretty much the best distributed database out there, but it's more of a toolkit for building databases than a complete batteries-included database.
I found that once I took the time to learn about FoundationDB and think about how to best integrate with it, the toolkit concept makes a lot of sense. Most people instinctively expect a database interface with a certain level of abstraction, and while that is nice to work with, it does not provide the advantages of a deeper integration.
To take an example: FoundationDB itself has no indexing. It's a key-value store, but you get plenty of great tools for maintaining indices. That sounded strange to me, until I understood that now I can write my indexing functions in my app's language (Clojure in my case), using my model data. That is so much better than using a "database language" with a limited set of data types.
Incidentally, I think that using SQL with FoundationDB is a waste, and I would not write a new app this way. Why would I want to talk to my database through a thin straw that mixes data with in-band commands?
Since FoundationDB is hard to understand, there is (and will be) strong resistance to adoption. That's just how things are: we do not enjoy thinking too hard.
> Since FoundationDB is hard to understand, there is (and will be) strong resistance to adoption. That's just how things are: we do not enjoy thinking too hard.
More like: we all have limited time, and if it's hard to understand you are asking for a big upfront time investment for a thing that may not even be the best fit for your use case.
Anything can be made easier to understand with the right abstractions. The theory of relativity was super hard to understand when it was first developed; you basically had to be an elite physicist. But now non-physicists can understand it at a high level thanks to YouTubers like veritasium and minute physics. Maybe FoundationDB just needs better marketing.
Also: your description of FoundationDB reminds me of ZeroMQ, which basically just dumps MQ legos at your feet and tells you to build your own MQ system (as opposed to a batteries included solution like RabbitMQ)
I'm not sure if you mean indexing implementation, or indexing of my model objects.
Indexing functions take model data and return:
* single index value which can be an UUID or a string
* single index value with multiple elements (e.g. a vector)
* multiple index values
They are Clojure functions, operating on Clojure data, testable with Clojure tests. As opposed to many other solutions, they do not live in the database, are not written in a limited database language with limited data types, and they operate on native (not coerced to db form) data values. Since they are functions of your data, you are not limited to indexing on a field or a combination of fields, you can compute a value (or values) based on your data.
The indexing implementation runs these when needed to get index values from model objects, and updates the actual index "tables" (areas in the key-value space, really).
There's a library here that implements a lot of database features and can be used on top of any sorted transactional K/V store, with FoundationDB being an exemplar backend:
It's pretty sophisticated, but the author uses it in his own projects and then just open sources it without trying to build a community around it so you may have to dig in to see that. It gives object mapping, indexing, composite indexing, triggers, query building and so on.
It's not "hard" to implement this stuff per se but it's a lot of work, especially to build enough test coverage to be convincing. I used to be quite into this idea of FoundationDB layers and especially Permazen, which I think is easily one of the best such layers even though it's not well known. I even wrote a SQL binding using Calcite so you could query your Permazen object stores in FDB using SQL!
I will say though, that in the recent past I started a job at Oracle Labs where I ended up using their database in a project, and that kind of gave me a new perspective on all this stuff. For example: scaling. Like a lot of people who spend too much time on Hacker News I used to think Postgres was state of the art, that RDBMS didn't scale well by design, and if you wanted one that did you'd need to use something exotic like layers on a FoundationDB cluster. But no. FoundationDB scales up to a few hundred nodes at most, and Oracle RAC/ExaData clusters can scale up that far too. There are people storing data from particle accelerators in ExaData clusters. The difference is the latter is a full SQL database with all the features you need to build an app right there already, instead of requiring you to depend on questionably well maintained upper layers that are very feature-light.
One place this hits you immediately is joins. Build out an ExaData cluster and you can model your data naturally whilst joining to your heart's content. The DB has lots of ways that it optimizes complex queries e.g. it pushes down predicates to the disk servers, it can read cached data directly out of other node's RAM over RDMA on a dedicated backbone network, and a whole lot more. Nearly every app requires complex queries, so this is a big deal. If you look at FoundationDB layers, then, well:
Now in the last few years FoundationDB added for a very, very simple kind of push-down predicate in which a storage server can dereference a key to form another key, but if you look closely (a) it's actually a layering violation in which the core system understands data formats used by the Record layer specifically so it messes up their nice architecture, (b) the upper layers don't really support it anyway and (c) this is very, very far from the convenience or efficiency of a SQL join.
Another big problem I found with modeling real apps was the five second transaction timeout. This is not, as you might expect, a configurable value. It's hard-coded into the servers and clients. This turns into a hugely awkward limitation and routinely wrecks your application logic and forces you to implement very tricky concurrency algorithms inside your app, just to do basic tasks. For example, computing most reports over a large dataset does not work with FoundationDB because you can't get a consistent snapshot for more than five seconds! There are size limits on writes too. When I talked to the Permazen author about how he handled this, he told me he dumps his production database into an offline MySQL in order to do analytics queries. Well. This did cool my ardour for the idea somewhat.
There are nonetheless two big differences or advantages to FoundationDB. One is that Apple has generously made it open source, so it's free. If you're the kind of guy who is willing to self-support a self-hosted KV storage cluster without any backing from the team that makes it, this is a cost advantage. Most people aren't though so this is actually a downside because there's no company that will sell you a support contract, and your database is the core of the business so you don't want to take risks there usually. The second is it supports fully serializable transactions within that five second window, which Oracle doesn't. I used to think this was a killer advantage, and I still do love the simplicity of strict serializability, but the five second window largely kills off most of the benefits because the moment you even run the risk of going beyond it, you have to break up your transactions and lose all atomicity. It also requires care to achieve full idempotency. Regular read committed or snapshot isolation transactions offer a lower consistency level, but they can last as long as you need, don't require looping and in practice that's often easier to work with.
Thanks for the insight. A $XX million exa-data system is no doubt impressive :)
> Another big problem I found with modeling real apps was the five second transaction timeout. This is not, as you might expect, a configurable value. It's hard-coded into the servers and clients. This turns into a hugely awkward limitation and routinely wrecks your application logic and forces you to implement very tricky concurrency algorithms inside your app, just to do basic tasks. For example, computing most reports over a large dataset does not work with FoundationDB because you can't get a consistent snapshot for more than five seconds!
I'm pretty sure that the 5-second transaction timeout is configurable with a knob. You just need enough RAM to hold the key-range information for the transaction timeout period. Basically: throughput * transaction_time_limit <= RAM, since FDB enforces that isolation reconciliation runs in memory.
But, the other reason that 5 seconds is the default is that e.g. 1 hour read/write transactions don't really make sense in the optimistic concurrency world. This is the downside of optimistic concurrency. The upside is that your system never gets blocked by bad-behaved long-running transactions, which is a serious issue in real production systems.
Finally, I think that the current "Redwood" storage engine does allow long-lived read transactions, even though the original engine backing FDB didn't.
If it's configurable that's very new. It definitely has been hard coded and unconfigurable by design the entire history of the project that I've known it. They just tell you to work around it, not tweak it for your app.
Transactions holding locks for too long are indeed a problem though in Oracle transactions can have priorities and steal each other's locks.
It looks like it might be knobs (which can be changed via config file) called MAX_WRITE_TRANSACTION_LIFE_VERSIONS and MAX_READ_TRANSACTION_LIFE_VERSIONS now (defined in ServerKnobs.cpp)? It's in microseconds and probably needs to be synced client and server.
I don't know the details know, but it was definitely configurable when I wrote it :) I remember arguing for setting it to a default of 30/60 seconds we decided against as that would have impacted throughput at our default RAM budget. I thought might have been a good tradeoff to get people going, thinking they could tune it down (or up the RAM) if they needed to scale up perf later.
Hah, well I'll defer to your knowledge then :) I don't remember seeing any kind of negotiation protocol between client and server to find out what the timeout is, which would be needed for it to be properly configurable. But it could be easily added I guess. I admit I never understood why changing it wasn't a proper feature.
> I started a job at Oracle Labs where I ended up using their database in a project, and that kind of gave me a new perspective on all this stuff.
I feel like a mandatory ~6 week boot camp using big boy SQL in a gigantic enterprise could go a long way to helping grow a lot of developers.
HN on average seems to have grave misconceptions regarding the true value of products like Oracle, MSSQL and DB2.
The whole reason you are paying money for these solutions is because you can't afford to screw with any of the other weird technical compromises. If you are using the RDBMS for a paid product and SQLite isn't a perfect fit, spending some money starts to make a lot of sense to me.
If the cost of a commercial SQL provider is too much for the margins to handle, I question if the business was ever viable to begin with.
Right. I should have done such a thing years ago. And, there are free/elastically scaled commercial RDBMS that are cheap. Oracle just give them away these days in the cloud. Even if you self-host, you can run small databases (up to iirc 12GB of data) for free, which is plenty for many internal apps.
The other thing that is super-unintuitive is the cost of cloud managed DBs. People like open source DBs because they see them as cheap and lacking lockin. I specced out what it costs to host a basic plain vanilla Postgres on AWS vs an Oracle DB on OCI at some point, and the latter was cheaper despite having far more features. Mostly because the Postgres didn't scale elastically and the Oracle DB did, plus AWS is just expensive. Well that was a shock. I'd always assumed a commercial RDBMS was a sort of expensive luxury, but once you decide you don't want to self-admin the DB the cost differences become irrelevant.
And then there's the lockin angle. Postgres doesn't do trademark enforcement so there's a lot of databases being advertised by cloud vendors as Postgres but are actually proprietary forks. The Jepsen test the other day was a wakeup call where an AWS "postgres" database didn't offer correct transactional isolation because of bugs in the forked code. If you're using cloud offerings you can end up depending on proprietary features or performance/scaling without realizing it.
So yeah. It's just all way more complex than I used to appreciate. FoundationDB is still a very nice piece of tech, and makes sense for the very specific iCloud business Apple use it for, but I'm not sure I'd try to use it if I was being paid to solve a problem fast.
> grave misconceptions regarding the true value of products like Oracle, MSSQL and DB2
In the context of this discussion, I would offer that we are getting into "apples vs oranges" comparisons here.
If you are doing custom queries for building reports where each report needs to access humongous amounts of data, SQL databases are likely a good fit.
If you need a fast and yet correct distributed database (for fault tolerance) for an online app backend, where the data and query patterns are known and do not change much over time and all retrieval is done using indexes, SQL databases are not a great fit, "big boy" or not.
As for "questioning if the business was ever viable to begin with", as a solo founder SaaS builder, I would humbly point you to numerous HN discussions where people are OUTRAGED at any subscriptions, and expect software to be an inexpensive one-time purchase. But that's a separate discussion.
For over a hundred nodes? A lot, but these aren't little VMs with 4G of RAM and two vCPUs we are talking about there. Almost nobody has databases that big.
For a more realistic deployment in the cloud with 2TB of data, 4TB of backup storage and peak of 64 cores with elastic scaling it's about 6k/month. So the moment you're spending more than a third of a decently skilled SWE on working around database limitations it's worth it. That's for a fully managed HA cluster with 99.95% uptime, rolling upgrades etc.
This is where Suspension of Disbelief stops for me — I've been taught by years of Jepsen analyses that "HA" really doesn't exist, especially in the SQL database world, and especially if there is a big company behind with lots of big complicated buzzwords.
What you usually get is single master replication, and you get to pick up the pieces if the master dies.
Jepsen doesn't work through databases in any kind of order as far as I can tell, and they haven't done an analysis of Oracle or many other popular databases. So I wouldn't take that as a representative sample.
RAC clusters are multi-write-master and HA. The drivers know how to fail over between nodes, can continue sessions across a failover and this can work even across major version upgrades. The tech has existed for a long time already.
If you're curious you can read about it more in the docs.
"Application continuity" i.e. continuing a connection if a node dies without the application noticing:
I mean, I understand the skepticism because HN never covers this stuff. It's "enterprise" and not "startup" for whatever reason, unless Google do it. But Oracle has been selling databases into nearly every big company on earth for decades. Do you really think nobody has built an HA and horizontally scalable SQL based database? They have and people have been using it to run 24/7 businesses at scale for a long time.
> 2TB of data, 4TB of backup storage and peak of 64 cores with elastic scaling it's about 6k/month
right, but foundationdb and similar is for cases where you actually need throughput of hundreds servers, for what you described OSS Postgres will work well.
It's confusing because of terminology differences. What FoundationDB calls a "server" is a single thread, often running on a commodity cloud VM. What Postgres calls a server is a multi-process instance on a single machine, often a commodity cloud VM. What ExaData calls a server is a dedicated bare metal machine with at least half a terabyte of RAM and a couple hundred CPU cores, with a gazillion processes running on it, linked to other cluster nodes by 100Gb networks. Often the machines are bigger.
Obviously then a workload that might require dozens or hundreds of "servers" in FoundationDB can fit on just one in an Oracle database. Hundreds of nodes in an ExaData cluster would mean tens of thousands of cores and hundreds of terabytes of RAM, not to even get into the disk sizes. The nodes in turn are linked by a dedicated high-bandwidth network, with user traffic being routed over the regular network.
As far as I know, nobody has ever scaled a FoundationDB cluster to tens of thousands of servers. So you can argue then that ExaData scales far better than FoundationDB does.
This makes sense for the design because in an RDBMS cluster the inputs and outputs are small. A little bit of SQL goes in, a small result set comes out (usually). But the data transfers and work required to compute that small result might be very large - in extremis, it may require reading the entire database from disk. Even with the best networks bandwidth inside a computer is much higher than bandwidth between computers, so you want the biggest machines possible before spilling horizontally. You just get much better bang-for-buck that way.
In FoundationDB on the other hand nodes barely communicate at all, because it's up to the client to do most of the work. If you write a workload that requires reading the entire database it will just fail immediately because you'll hit the five second timeout. If you do it without caring about transactional consistency (a huge sacrifice), you'll probably end up reading the entire database over a regular commodity network before being able to process it at all - much slower.
> Postgres calls a server is a multi-process instance on a single machine, often a commodity cloud VM. What ExaData calls a server is a dedicated bare metal machine with at least half a terabyte of RAM and a couple hundred CPU cores, with a gazillion processes running on it, linked to other cluster nodes by 100Gb networks. Often the machines are bigger.
You can also have PG cluster of beefy machines linked by 100Gb network, I didn't get what's the difference in this case.
This is again an issue of terminology. A PG cluster is at best one write master and then some full read replicas. You don't necessarily need a fast interconnect for that, but your write traffic can never exceed the power of the one master, and storage isn't sharded but replicated (unless you host all the PG files on a SAN which would be slow).
A RAC cluster is multi-write master, and each node stores a subset of the data. You can then add as many read caches (not replicas) as you need, and they'll cache only the hot data blocks.
So they're not quite the same thing in that sense.
PG ecosystem actually has multiple shared nothing cluster implementations, example is www.citusdata.com, unlike RAC, where masters need to sync to accept writes, so technically write load is not distributted across servers.
While this isn't specifically for FoundationDB, I wrote this post many years ago about how to implement (secondary) indexes on top of a key-value store. [0]
Interesting that you choose to do it manually. Do you do this as a transaction so that the indexes are always updated. If not, how do you handle partial failures?
FoundationDB's original promise was to combine a distributed storage engine with stateless layers on top to expose a variety of useful data structures (including SQL). The company was acquired before it could release the layers part of that equation. Apple open-sourced the core storage engine a few years later so FDB has kind of had a second life since then.
In that second life, the vast majority of the powerful databases around the industry built on FoundationDB have been built by companies making their own custom layers that are not public. This release is cool because it's a rare case that a company that has built a non-trivial layer on top of FDB is letting that source code be seen.
The group to which the FoundationDB storage engine itself appeals is fairly narrow--you have to want to go deep enough to build your own database/datastore, but not so deep to want to start from scratch. But, for this group, there is still nothing like FoundationDB in the industry--a distributed ACID KV store of extreme performance and robustness. So, yes, it's still the hotness in that sense. (As others have mentioned, see e.g. Deepseek's recent reveal of their 3FS distributed filesystem which relies on FDB.)
There have been some 3rd party toy projects in the past years, but pretty sure nothing was ever released by FoundationDB/Apple relating to SQL (until this additional SQL interface on the Apple "record layer").
Foundation is fundamental to iCloud at Apple, and is _something_ at Snowflake, among a few others. Recently DeepSeek used it for https://github.com/deepseek-ai/3FS "The Fire-Flyer File System (3FS) is a high-performance distributed file system designed to address the challenges of AI training and inference workloads."
I don't think that there's anything else quite the same, partly because it has some real oddities that manifest because of things like the transaction time limits. At Apple they worked around some of this with https://www.foundationdb.org/files/QuiCK.pdf
There is a company delivering a data platform for the Industry 4.0 named Cognite based in Oslo, Norway that migrated from Google BigQuery to their own database on on top of FoundationDB.
It’s now been approximately 7 years since Apple open sourced FoundationDB. Note that it was closed source before the acquisition, which is often not appreciated.
I Thin that was more around 1.5 decades by now :). Yeah I was super enthusiastic about it. Seemed perfect. Back then I considered Riak, MongoDB, and things like Tokyo cabinet.
There is one more project that aims to build a MongoDB-like query engine and uses Redis wire protocol. It's Kronotop: https://github.com/kronotop/kronotop
Kronotop uses FoundationDB as a metadata store for document indexes and the other stuff. It stores the document bodies on the local disk and supports primary-follower replication.
It also works as a RESP3/RESP2 proxy for FoundationDB API.
At some point someone will reimplement the dynamodb api on top of foundation db. That’ll be nice because then you have an effectively cheap hosted version available then.
I really really want nodejs bindings for foundationdb record layer. I tried using node java bridge, and it could be made to work but it'd be quiet an effort to maintain I guess...
Theoretically you could write your own client library, but this is nontrivial — this is a distributed database. The client library talks to multiple servers. It's not a "connect to a socket and send commands" type of client library, like in case of SQL servers.
The hard part is that there is no client spec you can follow as a third-party. Everything is implementation-defined. If you're out-of-tree, your code can break at any time. If the FoundationDB project committed to a protocol, client authors could write libraries outside of the main project.
Can you though? The protocol is not very well documented and it seems to iterate rather rapidly with the server version that it aims to be compatible with.
This seems to be a theoretical discussion: I don't think I'd ever want to implement the client part of FoundationDB myself, and I don't really see a good reason to.
been watching foundationdb for ages and it's kinda crazy it's still holding up while new stuff keeps dropping. always makes me wonder what it takes to keep something useful for so long
The best technologies aren't always the most fashionable technologies — FoundationDB is very good, but it isn't flashy or fancy and the concepts that it addresses are hard to understand, so it isn't that popular. It also doesn't help that FoundationDB is really a toolkit (foundation), rather than a complete "here's my objects, store them" database.
With the “Record Layer” you get that “here’s my objects, store them”. Unfortunately, that layer is Java, so it’s not out of the box if that’s not your language. It’d be nice if they could provide an gRPC api for the Record Layer.
It really says something when the Aphyr guy doesn't even want to test FoundationDB because their testing suite is vastly superior and comprehensive: https://news.ycombinator.com/item?id=39358448
here's some insight into the team's thought processes around the development and testing of FoundationDB. A few went on to form Antithesis: https://antithesis.com/company/backstory/
There's an intriguing project which puts SQLite on top of FoundationDB that is quite intriguing, unfortunately the dev seems to have moved on from that effort:
If you are storing rows as records in FDB I am very skeptical about the performance. It seems to me it would be quite poor, because of the latency. You are talking to the network to get every row.
I guess it would be scalable. You could execute lots of concurrent queries, but the actual performance of every non trivial query would be poor compared to a regular SQL DB.
These sorts of numbers need very careful understanding and analysis, otherwise they can be misleading. It isn't comparing apples to oranges to measure the performance of FoundationDB transactions vs an RDBMS.
A FoundationDB transaction is a vastly less powerful thing than an RDBMS transaction. In particular, an operation that would be one transaction in an RDBMS must often be split into several transactions in FoundationDB due to its timeout and size limits. An RDBMS allows a client to take thinking time for as long as it needs whilst working on a transaction. FoundationDB stores its undo logs in RAM rather than on disk. This lets them advertise great latencies but is the cause of the five second timeout. They actually advertise this as an "anti feature":
And KV stores suffer the N+1 query problem once you go over the network: traversing links between objects/joins requires round-trips through the client unless you have a very smart record layer and only do simple traversals, which can immediately kill your latency. But in FoundationDB higher latency = smaller numbers of possible operations before your transaction expires and has to be retried from scratch. Thus you can get into a situation where the cluster slowing down due to network congestion can cause death spirals in which transactions start dying because they ran out of time, get retried by the client and this places more load on the cluster which then slows down even more, etc. Whereas in an RDBMS the database slowing down causes backpressure to the clients that is ultimately propagated to the user, avoiding this kind of problem.
So... database benchmarks. Complex thing. Be careful out there! See my other comments on this thread for disclosures and more discussion of RDBMS vs FoundationDB.
Which is why I said it probably scales well. But you can not equate that with per query performance. Accessing rows over the network is a few orders of magnitude slower than local NVME disk with big RAM caches.
Really cool although quite an undertaking to build an entire SQL engine! I have been working on something pretty similar but using Postgres instead. Basically extending Postgres to run stateless on top of FoundationDB, which would achieve the same thing but with all the Postgres features one would expect (and without some quirks you might not want, like vacuuming).
Working with FoundationDB is a real pleasure as many people have noted already, a very intuitive abstraction to build databases on top of.
How do you plan to solve the N+1 query issue and the five second timeout?
The five second timeout remains and trickles through to apply to Postgres transactions instead. This project would very much be for OLTP, so not really fit for using Postgres for OLAP or hybrid workloads.
The N+1 issue is a really interesting one which I have a plan for but haven't implemented yet. FoundationDB has what they call mapped ranges [0] to help with this, which works in some cases. More generally, one should make sure to issue read request as soon as possible and not on-demand, given that FoundationDB clients have a futures-based design. This is slightly tricky in Postgres because internally it has a pull-based model where one tuple is pulled from the execution plan at a time, so one needs to implement pre-fetching and not make a read against FDB each time a new tuple is pulled.
[0] https://github.com/apple/foundationdb/wiki/Everything-about-...
Right, you'll need to be careful to adjust the read ranges manually to skip pre-fetched but unread tuples. Other aspects that create complexity include SQL's lock based semantics, and that FDB only offers strictly serializable transactions with optimistic concurrency control (requires looping at the originator of the transaction), whereas SQL based apps assume pessimistic non-looping concurrency control.
Yes, definitely, getting these things right I think are key to making the project actually work. Currently just experimental so we'll see!
Neon encounters a similar problem and performs prefetching of pages before Postgres wants to read them: https://neon.tech/docs/extensions/neon#prefetch-option
This is the interview question you would use to spot people that have actually used FDB vs just talked about it.
If anybody wants to follow along, I'll be publishing it here once ready: https://github.com/fabianlindfors/pgfdb (currently just an empty repo!)
I remember learning about FoundationDB a decade ago and being deeply impressed with what they built. Then it was acquired by Apple and went silent. Since then we've seen an explosion in new database storage layers. I'm curious is FoundationDB still the new hotness or has it been replaced by newer better technologies?
FoundationDB is pretty much the best distributed database out there, but it's more of a toolkit for building databases than a complete batteries-included database.
I found that once I took the time to learn about FoundationDB and think about how to best integrate with it, the toolkit concept makes a lot of sense. Most people instinctively expect a database interface with a certain level of abstraction, and while that is nice to work with, it does not provide the advantages of a deeper integration.
To take an example: FoundationDB itself has no indexing. It's a key-value store, but you get plenty of great tools for maintaining indices. That sounded strange to me, until I understood that now I can write my indexing functions in my app's language (Clojure in my case), using my model data. That is so much better than using a "database language" with a limited set of data types.
Incidentally, I think that using SQL with FoundationDB is a waste, and I would not write a new app this way. Why would I want to talk to my database through a thin straw that mixes data with in-band commands?
Since FoundationDB is hard to understand, there is (and will be) strong resistance to adoption. That's just how things are: we do not enjoy thinking too hard.
> Since FoundationDB is hard to understand, there is (and will be) strong resistance to adoption. That's just how things are: we do not enjoy thinking too hard.
More like: we all have limited time, and if it's hard to understand you are asking for a big upfront time investment for a thing that may not even be the best fit for your use case.
Anything can be made easier to understand with the right abstractions. The theory of relativity was super hard to understand when it was first developed; you basically had to be an elite physicist. But now non-physicists can understand it at a high level thanks to YouTubers like veritasium and minute physics. Maybe FoundationDB just needs better marketing.
Also: your description of FoundationDB reminds me of ZeroMQ, which basically just dumps MQ legos at your feet and tells you to build your own MQ system (as opposed to a batteries included solution like RabbitMQ)
Can we see some of your indexing code?
I'm not sure if you mean indexing implementation, or indexing of my model objects.
Indexing functions take model data and return:
* single index value which can be an UUID or a string
* single index value with multiple elements (e.g. a vector)
* multiple index values
They are Clojure functions, operating on Clojure data, testable with Clojure tests. As opposed to many other solutions, they do not live in the database, are not written in a limited database language with limited data types, and they operate on native (not coerced to db form) data values. Since they are functions of your data, you are not limited to indexing on a field or a combination of fields, you can compute a value (or values) based on your data.
The indexing implementation runs these when needed to get index values from model objects, and updates the actual index "tables" (areas in the key-value space, really).
There's a library here that implements a lot of database features and can be used on top of any sorted transactional K/V store, with FoundationDB being an exemplar backend:
https://github.com/permazen/permazen
It's pretty sophisticated, but the author uses it in his own projects and then just open sources it without trying to build a community around it so you may have to dig in to see that. It gives object mapping, indexing, composite indexing, triggers, query building and so on.
It's not "hard" to implement this stuff per se but it's a lot of work, especially to build enough test coverage to be convincing. I used to be quite into this idea of FoundationDB layers and especially Permazen, which I think is easily one of the best such layers even though it's not well known. I even wrote a SQL binding using Calcite so you could query your Permazen object stores in FDB using SQL!
I will say though, that in the recent past I started a job at Oracle Labs where I ended up using their database in a project, and that kind of gave me a new perspective on all this stuff. For example: scaling. Like a lot of people who spend too much time on Hacker News I used to think Postgres was state of the art, that RDBMS didn't scale well by design, and if you wanted one that did you'd need to use something exotic like layers on a FoundationDB cluster. But no. FoundationDB scales up to a few hundred nodes at most, and Oracle RAC/ExaData clusters can scale up that far too. There are people storing data from particle accelerators in ExaData clusters. The difference is the latter is a full SQL database with all the features you need to build an app right there already, instead of requiring you to depend on questionably well maintained upper layers that are very feature-light.
One place this hits you immediately is joins. Build out an ExaData cluster and you can model your data naturally whilst joining to your heart's content. The DB has lots of ways that it optimizes complex queries e.g. it pushes down predicates to the disk servers, it can read cached data directly out of other node's RAM over RDMA on a dedicated backbone network, and a whole lot more. Nearly every app requires complex queries, so this is a big deal. If you look at FoundationDB layers, then, well:
https://github.com/permazen/permazen/issues/31
Now in the last few years FoundationDB added for a very, very simple kind of push-down predicate in which a storage server can dereference a key to form another key, but if you look closely (a) it's actually a layering violation in which the core system understands data formats used by the Record layer specifically so it messes up their nice architecture, (b) the upper layers don't really support it anyway and (c) this is very, very far from the convenience or efficiency of a SQL join.
Another big problem I found with modeling real apps was the five second transaction timeout. This is not, as you might expect, a configurable value. It's hard-coded into the servers and clients. This turns into a hugely awkward limitation and routinely wrecks your application logic and forces you to implement very tricky concurrency algorithms inside your app, just to do basic tasks. For example, computing most reports over a large dataset does not work with FoundationDB because you can't get a consistent snapshot for more than five seconds! There are size limits on writes too. When I talked to the Permazen author about how he handled this, he told me he dumps his production database into an offline MySQL in order to do analytics queries. Well. This did cool my ardour for the idea somewhat.
There are nonetheless two big differences or advantages to FoundationDB. One is that Apple has generously made it open source, so it's free. If you're the kind of guy who is willing to self-support a self-hosted KV storage cluster without any backing from the team that makes it, this is a cost advantage. Most people aren't though so this is actually a downside because there's no company that will sell you a support contract, and your database is the core of the business so you don't want to take risks there usually. The second is it supports fully serializable transactions within that five second window, which Oracle doesn't. I used to think this was a killer advantage, and I still do love the simplicity of strict serializability, but the five second window largely kills off most of the benefits because the moment you even run the risk of going beyond it, you have to break up your transactions and lose all atomicity. It also requires care to achieve full idempotency. Regular read committed or snapshot isolation transactions offer a lower consistency level, but they can last as long as you need, don't require looping and in practice that's often easier to work with.
Thanks for the insight. A $XX million exa-data system is no doubt impressive :)
> Another big problem I found with modeling real apps was the five second transaction timeout. This is not, as you might expect, a configurable value. It's hard-coded into the servers and clients. This turns into a hugely awkward limitation and routinely wrecks your application logic and forces you to implement very tricky concurrency algorithms inside your app, just to do basic tasks. For example, computing most reports over a large dataset does not work with FoundationDB because you can't get a consistent snapshot for more than five seconds!
I'm pretty sure that the 5-second transaction timeout is configurable with a knob. You just need enough RAM to hold the key-range information for the transaction timeout period. Basically: throughput * transaction_time_limit <= RAM, since FDB enforces that isolation reconciliation runs in memory.
But, the other reason that 5 seconds is the default is that e.g. 1 hour read/write transactions don't really make sense in the optimistic concurrency world. This is the downside of optimistic concurrency. The upside is that your system never gets blocked by bad-behaved long-running transactions, which is a serious issue in real production systems.
Finally, I think that the current "Redwood" storage engine does allow long-lived read transactions, even though the original engine backing FDB didn't.
If it's configurable that's very new. It definitely has been hard coded and unconfigurable by design the entire history of the project that I've known it. They just tell you to work around it, not tweak it for your app.
Transactions holding locks for too long are indeed a problem though in Oracle transactions can have priorities and steal each other's locks.
It looks like it might be knobs (which can be changed via config file) called MAX_WRITE_TRANSACTION_LIFE_VERSIONS and MAX_READ_TRANSACTION_LIFE_VERSIONS now (defined in ServerKnobs.cpp)? It's in microseconds and probably needs to be synced client and server.
I don't know the details know, but it was definitely configurable when I wrote it :) I remember arguing for setting it to a default of 30/60 seconds we decided against as that would have impacted throughput at our default RAM budget. I thought might have been a good tradeoff to get people going, thinking they could tune it down (or up the RAM) if they needed to scale up perf later.
Hah, well I'll defer to your knowledge then :) I don't remember seeing any kind of negotiation protocol between client and server to find out what the timeout is, which would be needed for it to be properly configurable. But it could be easily added I guess. I admit I never understood why changing it wasn't a proper feature.
> I started a job at Oracle Labs where I ended up using their database in a project, and that kind of gave me a new perspective on all this stuff.
I feel like a mandatory ~6 week boot camp using big boy SQL in a gigantic enterprise could go a long way to helping grow a lot of developers.
HN on average seems to have grave misconceptions regarding the true value of products like Oracle, MSSQL and DB2.
The whole reason you are paying money for these solutions is because you can't afford to screw with any of the other weird technical compromises. If you are using the RDBMS for a paid product and SQLite isn't a perfect fit, spending some money starts to make a lot of sense to me.
If the cost of a commercial SQL provider is too much for the margins to handle, I question if the business was ever viable to begin with.
Right. I should have done such a thing years ago. And, there are free/elastically scaled commercial RDBMS that are cheap. Oracle just give them away these days in the cloud. Even if you self-host, you can run small databases (up to iirc 12GB of data) for free, which is plenty for many internal apps.
The other thing that is super-unintuitive is the cost of cloud managed DBs. People like open source DBs because they see them as cheap and lacking lockin. I specced out what it costs to host a basic plain vanilla Postgres on AWS vs an Oracle DB on OCI at some point, and the latter was cheaper despite having far more features. Mostly because the Postgres didn't scale elastically and the Oracle DB did, plus AWS is just expensive. Well that was a shock. I'd always assumed a commercial RDBMS was a sort of expensive luxury, but once you decide you don't want to self-admin the DB the cost differences become irrelevant.
And then there's the lockin angle. Postgres doesn't do trademark enforcement so there's a lot of databases being advertised by cloud vendors as Postgres but are actually proprietary forks. The Jepsen test the other day was a wakeup call where an AWS "postgres" database didn't offer correct transactional isolation because of bugs in the forked code. If you're using cloud offerings you can end up depending on proprietary features or performance/scaling without realizing it.
So yeah. It's just all way more complex than I used to appreciate. FoundationDB is still a very nice piece of tech, and makes sense for the very specific iCloud business Apple use it for, but I'm not sure I'd try to use it if I was being paid to solve a problem fast.
> grave misconceptions regarding the true value of products like Oracle, MSSQL and DB2
In the context of this discussion, I would offer that we are getting into "apples vs oranges" comparisons here.
If you are doing custom queries for building reports where each report needs to access humongous amounts of data, SQL databases are likely a good fit.
If you need a fast and yet correct distributed database (for fault tolerance) for an online app backend, where the data and query patterns are known and do not change much over time and all retrieval is done using indexes, SQL databases are not a great fit, "big boy" or not.
As for "questioning if the business was ever viable to begin with", as a solo founder SaaS builder, I would humbly point you to numerous HN discussions where people are OUTRAGED at any subscriptions, and expect software to be an inexpensive one-time purchase. But that's a separate discussion.
> few hundred nodes at most, and Oracle RAC/ExaData clusters can scale up that far too
and how much license fee it will cost?
For over a hundred nodes? A lot, but these aren't little VMs with 4G of RAM and two vCPUs we are talking about there. Almost nobody has databases that big.
For a more realistic deployment in the cloud with 2TB of data, 4TB of backup storage and peak of 64 cores with elastic scaling it's about 6k/month. So the moment you're spending more than a third of a decently skilled SWE on working around database limitations it's worth it. That's for a fully managed HA cluster with 99.95% uptime, rolling upgrades etc.
> fully managed HA cluster
This is where Suspension of Disbelief stops for me — I've been taught by years of Jepsen analyses that "HA" really doesn't exist, especially in the SQL database world, and especially if there is a big company behind with lots of big complicated buzzwords.
What you usually get is single master replication, and you get to pick up the pieces if the master dies.
Jepsen doesn't work through databases in any kind of order as far as I can tell, and they haven't done an analysis of Oracle or many other popular databases. So I wouldn't take that as a representative sample.
RAC clusters are multi-write-master and HA. The drivers know how to fail over between nodes, can continue sessions across a failover and this can work even across major version upgrades. The tech has existed for a long time already.
If you're curious you can read about it more in the docs.
"Application continuity" i.e. continuing a connection if a node dies without the application noticing:
https://docs.oracle.com/en/database/oracle/oracle-database/2...
Fast failover and load balancing for cluster clients (here in the context of Java but it works for other languages too):
https://docs.oracle.com/en/database/oracle/oracle-database/2...
I mean, I understand the skepticism because HN never covers this stuff. It's "enterprise" and not "startup" for whatever reason, unless Google do it. But Oracle has been selling databases into nearly every big company on earth for decades. Do you really think nobody has built an HA and horizontally scalable SQL based database? They have and people have been using it to run 24/7 businesses at scale for a long time.
> 2TB of data, 4TB of backup storage and peak of 64 cores with elastic scaling it's about 6k/month
right, but foundationdb and similar is for cases where you actually need throughput of hundreds servers, for what you described OSS Postgres will work well.
It's confusing because of terminology differences. What FoundationDB calls a "server" is a single thread, often running on a commodity cloud VM. What Postgres calls a server is a multi-process instance on a single machine, often a commodity cloud VM. What ExaData calls a server is a dedicated bare metal machine with at least half a terabyte of RAM and a couple hundred CPU cores, with a gazillion processes running on it, linked to other cluster nodes by 100Gb networks. Often the machines are bigger.
Obviously then a workload that might require dozens or hundreds of "servers" in FoundationDB can fit on just one in an Oracle database. Hundreds of nodes in an ExaData cluster would mean tens of thousands of cores and hundreds of terabytes of RAM, not to even get into the disk sizes. The nodes in turn are linked by a dedicated high-bandwidth network, with user traffic being routed over the regular network.
As far as I know, nobody has ever scaled a FoundationDB cluster to tens of thousands of servers. So you can argue then that ExaData scales far better than FoundationDB does.
This makes sense for the design because in an RDBMS cluster the inputs and outputs are small. A little bit of SQL goes in, a small result set comes out (usually). But the data transfers and work required to compute that small result might be very large - in extremis, it may require reading the entire database from disk. Even with the best networks bandwidth inside a computer is much higher than bandwidth between computers, so you want the biggest machines possible before spilling horizontally. You just get much better bang-for-buck that way.
In FoundationDB on the other hand nodes barely communicate at all, because it's up to the client to do most of the work. If you write a workload that requires reading the entire database it will just fail immediately because you'll hit the five second timeout. If you do it without caring about transactional consistency (a huge sacrifice), you'll probably end up reading the entire database over a regular commodity network before being able to process it at all - much slower.
> Postgres calls a server is a multi-process instance on a single machine, often a commodity cloud VM. What ExaData calls a server is a dedicated bare metal machine with at least half a terabyte of RAM and a couple hundred CPU cores, with a gazillion processes running on it, linked to other cluster nodes by 100Gb networks. Often the machines are bigger.
You can also have PG cluster of beefy machines linked by 100Gb network, I didn't get what's the difference in this case.
This is again an issue of terminology. A PG cluster is at best one write master and then some full read replicas. You don't necessarily need a fast interconnect for that, but your write traffic can never exceed the power of the one master, and storage isn't sharded but replicated (unless you host all the PG files on a SAN which would be slow).
A RAC cluster is multi-write master, and each node stores a subset of the data. You can then add as many read caches (not replicas) as you need, and they'll cache only the hot data blocks.
So they're not quite the same thing in that sense.
PG ecosystem actually has multiple shared nothing cluster implementations, example is www.citusdata.com, unlike RAC, where masters need to sync to accept writes, so technically write load is not distributted across servers.
While this isn't specifically for FoundationDB, I wrote this post many years ago about how to implement (secondary) indexes on top of a key-value store. [0]
https://misfra.me/2017/01/18/how-to-implement-secondary-inde...
Interesting that you choose to do it manually. Do you do this as a transaction so that the indexes are always updated. If not, how do you handle partial failures?
FoundationDB's original promise was to combine a distributed storage engine with stateless layers on top to expose a variety of useful data structures (including SQL). The company was acquired before it could release the layers part of that equation. Apple open-sourced the core storage engine a few years later so FDB has kind of had a second life since then.
In that second life, the vast majority of the powerful databases around the industry built on FoundationDB have been built by companies making their own custom layers that are not public. This release is cool because it's a rare case that a company that has built a non-trivial layer on top of FDB is letting that source code be seen.
The group to which the FoundationDB storage engine itself appeals is fairly narrow--you have to want to go deep enough to build your own database/datastore, but not so deep to want to start from scratch. But, for this group, there is still nothing like FoundationDB in the industry--a distributed ACID KV store of extreme performance and robustness. So, yes, it's still the hotness in that sense. (As others have mentioned, see e.g. Deepseek's recent reveal of their 3FS distributed filesystem which relies on FDB.)
AFAIK, the SQL layer was available and released
There have been some 3rd party toy projects in the past years, but pretty sure nothing was ever released by FoundationDB/Apple relating to SQL (until this additional SQL interface on the Apple "record layer").
2014: Beta launch of the SQL Layer: https://news.ycombinator.com/item?id=7593388
2014: Download SQL Layer: https://web.archive.org/web/20141209200002/https://foundatio...
2013: They talk about the SQL Layer: https://web.archive.org/web/20131220201001/https://foundatio...
2013: And someone can't start the SQL Layer on windows: https://web.archive.org/web/20131222063330/http://community....
Foundation is fundamental to iCloud at Apple, and is _something_ at Snowflake, among a few others. Recently DeepSeek used it for https://github.com/deepseek-ai/3FS "The Fire-Flyer File System (3FS) is a high-performance distributed file system designed to address the challenges of AI training and inference workloads."
I don't think that there's anything else quite the same, partly because it has some real oddities that manifest because of things like the transaction time limits. At Apple they worked around some of this with https://www.foundationdb.org/files/QuiCK.pdf
Tigris (an object storage provider, I have no affiliation) also uses FoundationDB for storing metadata:
https://www.tigrisdata.com/docs/concepts/architecture/#found...
Snowflake uses FoundationDB for their metadata store and...something else which isn't public.
ya here's the link explaining the metadata store: https://www.snowflake.com/en/blog/how-foundationdb-powers-sn...
Hmm? A secret weapon to slow down Databricks?
There is a company delivering a data platform for the Industry 4.0 named Cognite based in Oslo, Norway that migrated from Google BigQuery to their own database on on top of FoundationDB.
The video about it is available here: https://2023.javazone.no/program/85eae038-49b5-4f32-83c6-077... After watching, my thoughts were; why didn't you just use Clickhouse?
It’s now been approximately 7 years since Apple open sourced FoundationDB. Note that it was closed source before the acquisition, which is often not appreciated.
I Thin that was more around 1.5 decades by now :). Yeah I was super enthusiastic about it. Seemed perfect. Back then I considered Riak, MongoDB, and things like Tokyo cabinet.
FoundationDB to me is like Duke Nukem Forever.
I don’t need it anymore. At least not for now
There is one more project that aims to build a MongoDB-like query engine and uses Redis wire protocol. It's Kronotop: https://github.com/kronotop/kronotop
Kronotop uses FoundationDB as a metadata store for document indexes and the other stuff. It stores the document bodies on the local disk and supports primary-follower replication.
It also works as a RESP3/RESP2 proxy for FoundationDB API.
My favorite FoundationDB layer is per-user SQLite databases: https://github.com/losfair/mvsqlite
It's hard to tell if it's running in production, but the author works at Deno!
there was some discussion early on another thread about the one sqlite-db-per-vendor infra architecture can't remember maybe on duckdb one?
At some point someone will reimplement the dynamodb api on top of foundation db. That’ll be nice because then you have an effectively cheap hosted version available then.
At last Foundationdb has SQL Layer. AFAIK the initial discussion was in 2018 [1]
[1] SQL layer in FoundationDB, https://forums.foundationdb.org/t/sql-layer-in-foundationdb/...
So how will this round improve on the previous design that was quite slow?
> FDB-SQL was less than half as fast as MySQL on a single machine.
https://www.voltactivedata.com/blog/2015/04/foundationdbs-le...
I really really want nodejs bindings for foundationdb record layer. I tried using node java bridge, and it could be made to work but it'd be quiet an effort to maintain I guess...
Shouldn't be too hard. I built an Erlang/BeamVM driver/wrapper for it [1] before it got acquired by Apple... Their API is nice and clean.
[1] https://github.com/happypancake/fdb-erlang
Plain foundation db and document layer has bindings. It's the record layer that's a bit more complex with indexes, queries, etc.
FoundationDB is very cool, but I wish it didn't require linking in their C library to talk to it. The client story is not good.
Theoretically you could write your own client library, but this is nontrivial — this is a distributed database. The client library talks to multiple servers. It's not a "connect to a socket and send commands" type of client library, like in case of SQL servers.
The hard part is that there is no client spec you can follow as a third-party. Everything is implementation-defined. If you're out-of-tree, your code can break at any time. If the FoundationDB project committed to a protocol, client authors could write libraries outside of the main project.
Can you though? The protocol is not very well documented and it seems to iterate rather rapidly with the server version that it aims to be compatible with.
Are you planning to?
This seems to be a theoretical discussion: I don't think I'd ever want to implement the client part of FoundationDB myself, and I don't really see a good reason to.
You might be able to, but are definitely not supposed to. The client is conceptually "part of the cluster".
Java source at https://github.com/FoundationDB/fdb-record-layer
been watching foundationdb for ages and it's kinda crazy it's still holding up while new stuff keeps dropping. always makes me wonder what it takes to keep something useful for so long
The best technologies aren't always the most fashionable technologies — FoundationDB is very good, but it isn't flashy or fancy and the concepts that it addresses are hard to understand, so it isn't that popular. It also doesn't help that FoundationDB is really a toolkit (foundation), rather than a complete "here's my objects, store them" database.
With the “Record Layer” you get that “here’s my objects, store them”. Unfortunately, that layer is Java, so it’s not out of the box if that’s not your language. It’d be nice if they could provide an gRPC api for the Record Layer.
It really says something when the Aphyr guy doesn't even want to test FoundationDB because their testing suite is vastly superior and comprehensive: https://news.ycombinator.com/item?id=39358448
here's some insight into the team's thought processes around the development and testing of FoundationDB. A few went on to form Antithesis: https://antithesis.com/company/backstory/
There's an intriguing project which puts SQLite on top of FoundationDB that is quite intriguing, unfortunately the dev seems to have moved on from that effort:
https://github.com/losfair/mvsqlite
If you are storing rows as records in FDB I am very skeptical about the performance. It seems to me it would be quite poor, because of the latency. You are talking to the network to get every row.
I guess it would be scalable. You could execute lots of concurrent queries, but the actual performance of every non trivial query would be poor compared to a regular SQL DB.
FDB seems to outperform, and out-scale most, if not all?
According to an old report, FDB can do around 75k transactions per core [0].
MySQL on same CPU (all cores) can do about 8k [1].
[0] https://apple.github.io/foundationdb/performance.html
[1] https://www.anandtech.com/show/8357/exploring-the-low-end-an...
These sorts of numbers need very careful understanding and analysis, otherwise they can be misleading. It isn't comparing apples to oranges to measure the performance of FoundationDB transactions vs an RDBMS.
A FoundationDB transaction is a vastly less powerful thing than an RDBMS transaction. In particular, an operation that would be one transaction in an RDBMS must often be split into several transactions in FoundationDB due to its timeout and size limits. An RDBMS allows a client to take thinking time for as long as it needs whilst working on a transaction. FoundationDB stores its undo logs in RAM rather than on disk. This lets them advertise great latencies but is the cause of the five second timeout. They actually advertise this as an "anti feature":
https://apple.github.io/foundationdb/anti-features.html
And KV stores suffer the N+1 query problem once you go over the network: traversing links between objects/joins requires round-trips through the client unless you have a very smart record layer and only do simple traversals, which can immediately kill your latency. But in FoundationDB higher latency = smaller numbers of possible operations before your transaction expires and has to be retried from scratch. Thus you can get into a situation where the cluster slowing down due to network congestion can cause death spirals in which transactions start dying because they ran out of time, get retried by the client and this places more load on the cluster which then slows down even more, etc. Whereas in an RDBMS the database slowing down causes backpressure to the clients that is ultimately propagated to the user, avoiding this kind of problem.
So... database benchmarks. Complex thing. Be careful out there! See my other comments on this thread for disclosures and more discussion of RDBMS vs FoundationDB.
Which is why I said it probably scales well. But you can not equate that with per query performance. Accessing rows over the network is a few orders of magnitude slower than local NVME disk with big RAM caches.
[dead]