Wow, nice explanation. I was expecting a lengthy post with refuting everything uber said and explaining why pgsql was a better choice and uber was wrong.
Nice and clean post showing that acknowledging a weakness isn't a terrible choice.
I do wonder whether a different data structure would have mitigated the issue instead of transitioning to a different storage engine.
For the kind of stuff Uber stores, they may actually be doing it wrong (given what that Postgre mailing list post says) because that is one hell of an ugly use case for any DB.
I would have tried solving it by loading a dual E5v4 server full of 3TB and a slew of SSDs for L2ARC+ZIL under ZFS: more SSDs > bigger SSDs because the absolute worst case SSD performance that any and all SSDs suffer from is random reads (not writes) can slow to 200MB/sec even on those insanely fast "enterprise" PCI-E SSDs that do 2-4GB/sec continuous reads.
Given that, there are only four dbs worth using: Postgre, Oracle, DB2, and MS SQL. Unless you're doing something that is truly not suited for SQL (or SQL is insanely overkill), or you're Google and have a database in the hundreds or thousands of TB and literally have to write your own database (they went from almost inventing mapreduce, to going full circle back to full scale distributed RDBMS SQL with F1, which I wish they'd open source), any other database is just going to be a pain in the ass, buggy, and full of gotchas and undiscovered corner cases simply because it doesn't have over a decade of development and millions of users behind it.
Also, I have not included MySQL for obvious reasons: fun for toy SQL DBs where sqlite isn't a good fit, but not for enterprise use by any means. Uber switching to MySQL over Postgre is rather scary, I wouldn't want to be a Uber investor right now.
You wouldn't wanna be an Uber investor because choice of DBMS? As a programmer, I think this is the problem with developers thinking that technical problems are a bigger deal than it is.
Uber relies entirely on their database. If the database is slow, or cannot be considered reliable under extreme load, or can lose data in ways that are hard to recover, then Uber can potentially lose a lot of money especially during peak hours.
Yes, I'm aware there are systems for MySQL to handle failure, but I'm also aware of the systems for Postgre, and Postgre's failure handling seem to be far saner and easier to recover from. Defense in depth against failure is easier in Postgre from my experience.
This would be like Elon Musk blogging about how "oh yeah, sometimes the brand new Tesla factory shuts down completely because sometimes the power goes out; but we're using really popular well known power distribution systems, so we're industry compliant, so everything is okay."
If Elon Musk blogged that, HN would go apeshit, and rightfully so.
Postgreql is trustworth and predictable and engineered and engineerable. Its like a German union automobile plant press operator sitting down on the job and crossing his arms until the broken safety switch is fixed, which will take precisely 3.25 hours and cost $X while the resulting assembly line shutdown costs 1000 x $X. But it'll be safe and nobody gonna lose an arm. Your downtime and related costs are more or less predictable. Maybe not the highest productivity plant in the division, but nice safety record.
Mysql is best effort. The safety switch on the press breaks, redlining that machine and shutting down the entire plant. Hmm if I stick my arm in that 50 ton press while its operating, that'll hurt a bit, so lets just not do that. Dude's a real tryhard, which always ends like you'd expect. Of course safety regulations were literally written in blood so at some unpredictable time in the future you'll get a $1M personal injury lawsuit for loss of an arm and a $10M OSHA fine, and the plant will be shut down for the criminal investigation for a random indeterminate amount of time plus the interval required to remove arm from press. Your downtime and costs are completely unpredictable, but probably mostly over a very long term for many people on average lower than postgresql. The plant will have a higher productivity metric result, and also a worse safety record.
However there is an important point that philosophy doesn't matter when times are good. Its only when the tool is misused or there's a malfunction that the underlying philosophy even shows up.
It doesn't matter which system you use when you try to store Aug 1 2016 into a date column, but (at least in the old days) it was very interesting trying to store February 30th into the databases. Insert anyway with a warning? Round up, down, or stick in a null? Normalize it to being March 2nd ish? Insert fails completely with an error? This has varies with time and configuration but in a "general sweep of history" manner you can guess correctly most of the time what each DB does.
Also there's nothing wrong in any way with a critical system that drops into philosophical best effort mode during a crisis rather than paralytic halt mode. Well, there's nothing wrong with it as long as the system was engineered with that in mind and neither the dev nor ops people are surprised by that behavior. Sometimes that is the right thing to do.
Here is an Uber engineer's talk about their worst outage ever. 16 hours of downtime for their API as they repeatedly try and fail to promote a new postgresql master and reparent slaves to it.
the problem mostly happened cause ppl just ignored the "running out of disk space" message. Would happen on mysql, too. And I really wouldn't want that to happen on galera, I guess that would be a way bigger desaster.
Mysql eats and corrupts data by design. For a company responding to real time events in physical world, that can be a big issue. I know they're trying to improve their defaults lately, but a lot of weird behaviour remains. And you don't have to be an expert DBA to know that choosing a technology known for silent data corruption is risky.
The design of MySQL has a lot more silent failures, silent coercing of data, and other ways that it attempts to do what it thinks you might mean (because you're an incompetent PHP programmer) instead of what you ask. The obvious example is that SELECT 0 = 'banana' returns 1.
A typical takedown would be the likes of: http://grimoire.ca/mysql/choose-something-else (which also touches storage engine configuration things that are easier to defend against by an experienced organization). Unfortunately this sort of takedown solves very few discussions.
You can't configure MySQL to not do "any" of that. You can certainly make it better, but there simply aren't options to configure away all of the boneheadedness.
There are also tons of hidden gotchas that exist in, for example, the query planner. It can be extremely fickle and suddenly switch from a performant query plan to a terrible one that creates unindexed temporary tables and sorts them or joins against them. Or just ignores relevant indexes in the tables whatsoever.
Everything hums along fine until a random INSERT or UPDATE causes the query plan to change, bringing down your entire site. To be fair, such a problem can happen in any DBMS but I've never experienced it with Postgres to the extent that I have with MySQL.
You're thinking there's databases out there that are flawless, that never corrupt data, but that's garbage. They all do to a degree. They're also subject to being corrupted by hardware failures that aren't related to software.
Anyone with a huge production database running under load is going to have ways of mitigating these problems. Tumblr manages with MySQL, they open-sourced some of their tools like JetPants (https://github.com/tumblr/jetpants) to help build huge datasets.
So maybe Uber made a call and said "we can deal with intermittent corruption problems, we can recover from those, so long as the performance is better because a reputation for being slow is something we can't recover from". Life is all about trade-offs.
Nothing in my comment implies anything close to thinking that there are databases that are don't or flawless or don't corrupt data. I'm not sure how a comment about poor query optimization could possibly be interpreted that way.
Regardless, MySQL by default silently eats data in common situations (truncation of VARCHAR) and returns flat-out incorrect results due to PHP-style "helpful" coercions (SELECT 0 == "banana"). It implements UTF-8 incorrectly, but fixing it would break existing apps, so we're forever stuck with "utf8" encoding that isn't.
There are a million more of these, and while some of them have workarounds (strict tables, utf8mb4), many of them don't (automatic coercion, boneheaded query planner, creating implicit temporary tables without indexes even when present, etc.).
A comparison of MySQL to PHP is apt, honestly. The fact that PHP is a blight doesn't mean other languages don't have their own problems. But PHP (like MySQL) is in a league of its own here.
You can Google for silent truncation for a quick example. To be fair, MySql > 5.6 has fixed some of these issues and it also has some flags that can be set to help prevent them.
The by design part is referring to early versions of mysql and discussions around it purposely did not care about ACID. Speed was the number one driver.
I was recently advised by a DB consultant whose area of expertise is MySQL that 5.7 is still too new and risky and that he would advise against upgrading for at least another 6 months or more. He feels that the releases come out much, much too unstable and unpolished and that it typically takes at least a year since release before he's comfortable running it in production. I don't know enough about MySQL to know if that's true or not.
We are now investigating switching to MariaDB instead. (I'd personally love to move to Postgres, but that's not likely to happen any time soon)
I can confirm that the query optimizer introduced a rather serious bug (significantly suboptimal plan for queries involving low cardinality indices), which caused serious issues in our system.
This, in addition to the fact that index merging has been broken in MySQL 5.6 for more than an year now (in some cases it will cause empty resultsets to be returned), and that it is still broken on MySQL 5.7
Beware slight ddl incompatibilities. For example Maria will dump timestamp field size, which mysql doesn't understand (or was it the other way around?...)
"The default SQL mode in MySQL 5.7 includes these modes: ONLY_FULL_GROUP_BY, STRICT_TRANS_TABLES, NO_ZERO_IN_DATE, NO_ZERO_DATE, ERROR_FOR_DIVISION_BY_ZERO, NO_AUTO_CREATE_USER, and NO_ENGINE_SUBSTITUTION."
> Also, I have not included MySQL for obvious reasons: fun for toy SQL DBs where sqlite isn't a good fit, but not for enterprise use by any means. Uber switching to MySQL over Postgre is rather scary, I wouldn't want to be a Uber investor right now.
I've been running a few MySQL setups for a decade, why do you think it is not a good fit for enterprise apps? InnoDB solves most of the previous limitations. I also think MySQL is slightly easier to deploy/scale than Postgres. I believe Facebook is still running MySQL for instance.
... or you're Google and have a database in the hundreds or thousands of TB and literally have to write your own database (they went from almost inventing mapreduce, to going full circle back to full scale distributed RDBMS SQL with F1, which I wish they'd open source),
You may be interested in CockroachDB, which was inspired by Google's F1 and Spanner implementations.
No, it was not. FoundationDB's SQL layer was a joke. The system was horribly slow. Every DBA that I talked to that tried it said that it did not deserve the hype that it got.
I actually know engineers who work for Google for that. Google maintains their own MySQL fork internally and also know exactly how MySQL fucks up and has planned for MySQL failure in depth.
Google is maybe the only company I trust to know what they're doing when it comes to databases.
You realize that Google, Facebook, Yahoo, and many, many other shops have successfully run on MySQL for well over a decade, right?
You can armchair engineer all you like, but at the end of the day, the proof of the engineering is in the working.
And if you have Uber stock you'd like to sell just because you think they're making a questionable engineering decision around databases (despite all the real world evidence that it is adequate to serve large scale businesses), I know a few investors who would gladly take it off your hands.
Github works on MySQL and they got pretty good at it. Ubers engineers made a respectable decision by switching to a lower-hyped database purely on excellent engineering.
Mitigating against your database going mental is scary and should not be encouraged but it's the default in most MySQL shops.
Whether you /can/ coerce it into doing the right thing is not the issue, Uber are doing their own thing regardless of if its right anyway so they're unlikely to change in that regard- what makes you sure they will do MySQL right at all?
It's true that the large majority of FB's mysql fleet is dedicated to the main product data, essentially an object-graph store with a restricted access pattern. The queries are indeed relatively simple. I can't remember for certain but I think that simple joins and transactions are actually used there.
However, the rest of the fleet supports an extremely diverse set of workloads -- keep in mind that MySQL is the primary data store of Facebook, and this includes product, ad serving, payments, async task persistence, internal tooling, many many other things. Countless different query patterns are in use. And although this is a minority of FB's mysql fleet, it's still many many thousands of machines supporting these other workloads, substantially larger than the vast majority of Postgres installations in the world.
(Source: I worked on MySQL automation at FB, and was lead dev on their RDS-like DBaaS, used by a large portion of the company's engineering teams in one way or another.)
I will readily admit that Postgres is a very good database, and definitely a better choice for OLAP workloads than MySQL (which has problems with complex queries) / InnoDB (problems with long-running read queries impacting old-row purge behavior). However, for extremely high-volume, large-scale OLTP workloads, MySQL/InnoDB is absolutely an excellent choice, imo better than Postgres for reasons of performance, ecosystem, and number experienced engineers who have worked at extreme scale.
Google's hosted CloudSQL (a hosted &customized MySQL solution) had tons of connection drop issues. Last incident it took them 72+hrs to resolve and they had no idea what and why caused it (the resolution was a side effect from another issue that resolved for some other custom complains.)
This was my first thought, and generally what I always think when I hear about update heavy use cases. Typically they represent some issue in the object model, and often driven by an ORM layer.
I always ask is there a need for absolute correctness, and would this be better handled by a batch processing method.
I'm not sure any hardware would have solved their issue though. It may not have actually been disk speed, but issues in the mvcc design itself.
Maybe I've been in big companies for too long, but why not Teradata? Yeah it's expensive for the amount of storage Uber would need, but it's going to scale for them and perform well.
> Given that, there are only four dbs worth using: Postgre, Oracle, DB2, and MS SQL
This is total bullshit. Even a modest size database (3-5 terabyte) for a small company like mine, chokes a relational database bigtime on time series. Especially the modern use cases where you are ingesting fast. You need Cassandra or Hbase which are very serious pieces of technology and are definitely "worth using", and will crush any of the four relational dinosaurs you mention in that space at comparable cost, and with embarrassingly better scalability. And by the way, that space is not some corner case, it is probably the major growth area of the next computing era.
There are column-oriented relational databases for that. DB2 and SQL Server both implement columnar systems for OLAP-esque workloads pretty well, and I think Oracle has something like that as well. In the Free software space, MonetDB is pretty good and I believe there's some Postgres extension to store data column-wise (which is probably not competitive with ground-up column stores like MonetDB, but beats row-wise storage for certain workloads).
You don't need to give up proven relational databases for this web-scale bullshit. Financial companies have been doing hundreds of times “modern use cases where you are ingesting fast” for decades. KDB+ is the darling of that industry, but DB2 is pretty popular (IIRC) as well as some other niche options.
Right tool for the right job, basically. I have nothing against Cassandra or HBase, and they're both quite impressive pieces of engineering. But they're not an alternative to relational databases; rather, a complement for very niche use-cases. I see NoSQL stuff get used for “performance and scalability” in places where a columnar relational DB would scale just fine and bring the benefits of a relational database. I'm all for HBase for non-relational use-cases, but those are rather more rare than people seem to think (and end up with an ad hoc relational model because of it).
All credit belongs to my peers in the Big Data division, and the engineering directors who successfully argued to opensource our entire portfolio of data tech (Greenplum, HAWQ, Gemfire and MADLib).
who said they were an alternative to relational databases? The problem with the post was not that the 4 mentioned were bad, only that they were the "only four worth using" which is an eye-rollingly ridiculous statement. Anybody who asserts that relational databases are the only serious databases is plain wrong, or has an ulterior motive. And this is not for "web scale bullshit" by the way. It is about large amounts of financial timeseries ingest for machine learning optimizations that I can promise you, has nothing to do with the web.
This is one thing I like about HN's system: it lets me edit posts for a limited period. I'll sometimes bang out a post like the above, with a rather negative opening line like that and hit "submit". Then a minute later I'll realize I need to tone that down some, so I'll go edit it, as you suggest here. Unfortunately some shitty web forums don't allow you to edit posts at all, but I do appreciate ones which do allow editing stuff like this before anyone has a chance to read it.
HN also lets you set a delay before a comment appears, if you're like me and tend to catch mistakes more easily on the page than in the composition box. It's called "delay" on your settings page.
Yeah, but generally only when you're actually correct. If you're not, you look like an idiot. Apparently the parent hasn't heard of column-store relational databases (which are kinda great for time series).
(I'm most familiar with SQL Server; DB2 and Oracle apparently have similar functionality.)
Apparently their 3-5TB time series table chokes ‘dinosaur’ relational databases. For a time series table you should almost certain be using a column-store index with CREATE CLUSTERED COLUMNSTORE INDEX. Depending on how much they're querying vs inserting, that alone could prevent SQL Server from choking. If it's more insertion-heavy, but they still need to run a lot of ad-hoc queries, SQL Server 2016 supports using a nonclustered column-store index alongside a row-store. You can insert with the row-store and query with the column-store. If it's still choking with that, it might be time to check out the database structure and see what's up. One of the more common killers is putting a lot of data in the time series table. Generally it's more efficient to have a clustered column-store index table that contains the timestamp and ids to metadata, and keep the metadata itself in row-store tables with appropriate indexes.
of course I've heard of column store relational databases, but they're not nearly as efficient nor scalable as Cassandra which has been built from the ground up for this. Fact is, if you don't need referential integrity, there is absolutely no need to pay the massive complexity overhead and horizontal scalability penalties that the relational databases incur.
Separately it is absolutely correct that a comment that asserts that 4 relational databases are the only serious ones is completely misguided and deserves a robust rebuke.
Technology moves fast. Very fast. If a piece of tech gets old enough to be called a dinosaur, and is still run at brand-new companies, it's doing something right, even if it's not for your usecase.
And the fact is, not all of us have 3-5TB time seriesdatabases. And time series data can cause a lot of DBs to choke.
The point is, I'm glad Cassandra and Hbase, or a dedicated TSDB, or whatever, work for you. Your usecase is not the same as the rest of us, and assuming that it is is causing you to come to some incorrect conclusions.
Clearly the relational databases are very good and very useful in more cases than time series. However timeseries is huge, and growing - it's not a niche case anymore. Anyhow my beef was with the assertion that the 4 relational databases listed were the "only serious ones". This is completely untrue.
lol right, with MongoDb, Map Reduce is a joke, GridFS is slow and barely usable, the storage is extremely inefficient, the "query engine" slow, and don't get me started on their "full text search" engine. MongoDb is a successful marketing stunt in the "Nodejs era".
Also, wasn't there a ridiculous issue that they had where the db can't be bigger than 4gb on a 32 bit file system because that's the largest size a file can have...?
Yes, apparently the limit on 32-bit is 2GB actually. MongoDB has always stated upfront that 32-bit architectures are not recommended for production use for precisely this reason: http://blog.mongodb.org/post/137788967/32-bit-limitations
This has also has been stated on their download page for 32-bit binaries as well.
It hasn't always been stated there, and it was a warning on a README somewhere. The server would run happily on 32 bits, and never crash or produce any user-visible errors, it would just silently corrupt data while pretending the insertion went great.
And that's the real issue I would think. If it shut itself down or went into read-only and generated errors it wouldn't be so bad. The limitations 32-bit architectures are why we have 64bit architectures. I think it's fair as a developer to require x64 instead of jumping through hoops to support both architectures. But if you're going to release 32-bit binaries for something like development or testing you should be explicit about the use cases and limitations and avoid failing silently when they're exceeded.
This stems from their choice of "mmap and done" as their entire IO strategy, for earlier versions. The data loss is silly though and shows their attitude.
Oracle 9 on 32 bit can handle 32GiB databases without any issue, and bigger setup to use more that a single file per table space. And Oracle 9 it's old.
Note: I hate Oracle DB, but I must work with Oracle 9/10/11/2 DBs because is what our clients have.
> When running a 32-bit build of MongoDB, the total storage size for the server, including data and indexes, is 2 gigabytes. For this reason, do not deploy MongoDB to production on 32-bit machines.
DB exist on more than servers, e.g. in an embedded environment... You find sqlite in the oddest places... One could imagine finding mongodb there too, with bonus data corruption...
Umm... Wow, mongo. Boy, am I glad I didn't decide to pick you.
Let's get some statistics. MySQL and Postgres both have and undefined max DB size, and max table sizes upwards of 16 and 32 TB, respectively.
SQLite has no max table size I could find, but had a max DB size of 140 TB. 140TB. And this is from a database that reccomends not using it if your data grows too large.
Nice and clean post showing that acknowledging a weakness isn't a terrible choice.
I do wonder whether a different data structure would have mitigated the issue instead of transitioning to a different storage engine.