Designing a Database: What You Need to Know
Let’s start with the bad news: There’s no one perfect database. Every database format, from MySQL and Postgres to MongoDB, has different strengths and weaknesses. Or, to put it differently, every database’s strengths comes at a price.
“Just like everything else with computers,” Asya Kamsky, principal engineer at MongoDB, told Built In, “nothing is free.”
The software development adage, she said, applies to databases too: good, cheap and fast — pick two.
“Just like everything else with computers, nothing is free.”
“If you have a lot of data to store and you want to get all of it back quickly,” Kamsky elaborated, “either it’s going to cost you a lot of money or you’re not going to get all of it back.”
The good news is, you probably don’t need the whole trifecta. Different database use cases forgive different weaknesses.
Take Twitter, Kamsky explained. When you open the app, it’s not imperative that you see each and every recent tweet from the people you follow in a precise order. Accuracy isn’t key; you just have to open the app and see something. Twitter needs “fast,” but could opt out of “good.”
For Amazon, on the other hand, “good” is essential. When you browse your order history, you need to see each and every thing you ordered, timestamped and with an accurate delivery status. Luckily, this information doesn’t need to load in milliseconds, so Amazon can opt out of “fast.”
So, if you’re designing a database, don’t be a perfectionist, and do think in tradeoffs. It’s more complicated than that though. We spoke to three experts, including Kamsky, about the do’s and don’ts of database design.
What is Database Design?
- Calvin French-Owen
Chief technology officer at Segment
- Asya Kamsky
Principal engineer at MongoDB
- Alexander Sorokin
Lead architect at Airtable
The Do’s and Don’ts
Do Have a Plan...
Kamsky: It helps to figure out: How big of a service overall am I going to need? Like how many users am I going to have? How quickly do different pages need to respond? If it’s a website or if it’s a mobile app, how fast do different functions need to be? The database is almost an afterthought. It all starts with correctly thinking about what’s important for my application to succeed.
Sorokin: With database architecture, you have to really understand the data that you’re going to be putting in. Sometimes it’s hard if you don’t have real use cases or data. You have to be somewhat imaginative. You need to think: What are the data items that we’re putting in, what kind of queries would we like to be able to execute against this data, and how much data is going to be in this database? What are the attributes of the data? How quickly will they change? If the attributes change 10 times a day, you need to think about your data differently than if they change once a month.
... but Don’t Get Ahead of Yourself
French-Owen: You should never take the advice of a Google or Amazon or Facebook and try to adopt that prematurely. I think it’s really tempting to try and scale like them, but a lot of times, Google is designing databases for problems no one else has. They’re designing to analyze literally the entire text of every book that has ever been written. A lot of other companies don’t reach anywhere near that level. Design for the scale that you have today, thinking about the scale you’ll have tomorrow. Don’t jump on the bandwagon of picking up pieces of technology just because you’ve heard that the big players are using them.
Kamsky: What’s amazed me is how many times you ask someone who has, at peak, maybe 5,000 users, 10,000 users, “How high do you want this database to scale?” And they go: “What do you mean? As high as possible!”
Well, if you had infinite money you could do that. But realistically, no.
Do Consider Non-Relational Databases ...
French-Owen: Non-relational databases started to gain a lot of popularity about seven years ago, when Segment was just getting started. Really big companies like Google, Facebook and Amazon were all talking about how they were hitting the limits of their current relational databases, like Postgres for MySQL. They’re trying to scale to basically serve the whole internet. In order to do that, they had to create this new kind of database, a non-relational database. Mongo was kind of the poster child of those databases — instead of having to deal with all of these rules and this complicated processing inside a single database instance, it made it so you just really quickly could get back data in a more flexible format. It meant that you could scale a lot more easily.
Sorokin: I think generally, the biggest choice you need to make is whether you need a relational database, like a SQL database, or a NoSQL database. Most companies need both. The difference is roughly that relational databases allow you to query the data across records, and non-relational databases generally let you write and read much more efficiently but don’t provide you as much query ability across all of the data. If you’re in the early stages of development, and you don’t know the schema of your data, it’s easier to build a NoSQL database engine that scales well.
... but Don’t Underestimate Relational Databases
French-Owen: We’ve actually replaced a lot of our non-relational databases with relational databases over the years. The big thing that relational databases have going for them is their schema — a set of rules that defines what your data looks like. This is really important when it comes to updating multiple types of data at once.
For example, let’s say you have a set of data related to your users, and then a set of data related to something they own, like posts or comments. And you want to be able to say, “When a user deletes their account, remove all of the posts and all the comments that they made too.” If you have a relational database — a database where you’ve said, this comment belongs to this user, this post belongs to this user — you can automatically get rid of all of that content. If you have a non-relational database, you have to do all of that yourself, and it’s easier as a company to ignore it. It creates all sorts of lax areas where data’s just kind of hanging around.
Sorokin: Relational databases simplify coding for application developers. They’re not as flexible, but if you have to deal with the possibility that there are 50 fields on the user in a non-relational database, and every user may have filled out any combination of those, the code becomes much more complicated. So it’s actually beneficial to be rigorous and strict about the schema of the data, because if you can rely that on the fact that every user has a first name, every user has a login name, it makes your code easier.
Do Create Indexes to Boost Query Speed
Kamsky: Sometimes, clients set up their own databases, and then they come to us and say everything was working great, but now queries are slow. Or when they try to load this data, it takes a lot longer than it used to. We say something like, “What indexes have you created?” And they go: “What? What’s an index?”
Indexes are these data structures that tell the database, “I’m going to be querying by this attribute and I want it to be fast.” Indexes are not free. They make your reads faster, but they themselves consume resources and they make the writes a little bit slower. But if you don’t tell the database what you want, how’s it supposed to know?
Sorokin: The index in a database is essentially the same thing as a library index or a book index. You have some things that are ordered alphabetically. If you know the term you’re looking for, then you can very efficiently find it. But when you send a query to the database and you don’t tell it to use an index, or occasionally there is no index — it’s like if you have a book with no index, and you have to scroll through the pages. If the book is small, that’s not a problem, right? But at 5,000 pages, it becomes a problem. It’s gonna go slower.
Do Use Transactions, If You Can
Sorokin: A transaction is multiple changes that happen as one change. We use them a lot at Airtable — if you can use them, you should use them and really understand them. They’re extremely valuable. Generally, relational databases have transaction support and non-relational databases don’t.
The transaction model — the details of how the transaction actually operates — varies, though. For example, there’s a concept of “dirty reads.” This basically means you can see a draft before it goes live. Let’s imagine that we are changing the title of a blog post — the dirty read model says, if anybody wants to read what the title is right now, they will see the latest revision, regardless of whether the transaction has committed. Some databases allow dirty reads, some don’t. At Airtable, we don’t. We generally design for as much clarity as possible. But to use transactions effectively, you need to understand the transaction model.
Do Consider Pairing Your Database With a Data Warehouse
French-Owen: Fundamentally, databases and data warehouses do similar things. The way you can think of a database is that it’s used for more online work. So if I need to load up a user’s profile or a webpage, or get a response within a handful of milliseconds, the database works great. On the flip side, if I have a lot of data, terabytes or petabytes worth of data, it might be too big to fit in a database. So that’s typically where a data warehouse comes into play. It allows you to store more data, and typically you’ll get answers back within like 20 to 30 seconds.
I think we’re seeing a little bit more of a data Renaissance recently around data warehouses. Google and Amazon have really changed the game and this aspect by offering their own data warehouses that you can literally rent for $4 per hour. You can ask these deep custom questions, and they’ll just take care of all of the management for you at like one thousandth of the cost it used to be. That, paired with this kind of explosion of data, has allowed people to ask many more questions much more cheaply.
We see a bunch of companies using a database for the critical source data that they need from an operating perspective — like, “When this user loads up their timeline, they should see this set of posts.” Then they’ll split that off from a data warehouse where they’re keeping a bunch of less important, analytical data, like “This user visited like these sets of pages” or “This user filed X number of help desk requests.” It’s stuff that can be very useful when it comes to designing a better product, but you don’t need it to power logging in.
Don’t Rely on Just One Database, Necessarily ...
Kamsky: You don’t necessarily use only one database, right? It’s absolutely possible that you go, “For this use case I need one type of data store, but then for other analytics I need a different database.” This is why we have a lot of ways of moving the data and transforming the data between different systems.
The old way was to put everything into a single large database that was the system of record, and then every application had to figure out how to use that data. In reality, a lot of the applications would be like, “We’ll just copy the data over and transform it and store it differently, in a way that our application needs.” Because different applications may be using the data very differently.
It’s easier to scale with multiple databases too. If you have two different applications and they’re both using the same database, and one of the applications suddenly needs a lot more bandwidth, a lot more computing power, the database is going to get slower and it’ll make the other application slower too. Whereas yeah, if each application has its own database, it makes it easier to monitor the performance of each one.
... but Do Prioritize Simplicity
Sorokin: You should absolutely try to use as few databases as possible, especially when you’re starting out. Try to use as few database technologies as possible too. It’s very tempting to spin up a new database that’s perfect for your new use case, for that particular service, but the overhead of learning the ins and outs of your new database is high and the cost of maintaining it is high too. It’s OK to use a database that kind of works. It’s maybe not the cheapest, not the most optimal, but you already have it. Then, if your use cases grow and your production engineering team grows to be able to support 15 different databases, then you decide whether it’s worth spending time on moving to that specialized database.
Don’t Get Possessive
French-Owen: Amazon and Google and Microsoft have all come out with what they call managed databases. Effectively, they run the database for you and they back it up and tune it. They have a lot of control over it in ways that make your life easier. If you’d asked me six or seven years ago, I would’ve said, “Oh, no, you definitely want to run your own database because you have total control over that.” I think in today’s world, that’s actually a bad call. Amazon and Google and Microsoft have gotten so good at running so many databases for hundreds of thousands of companies, you should always sign up to have them run your database.
Kamsky: I’ve been a hands-on, you know, senior backend developer for years and years. And so I go, “But if I’m not running the database server, what if I need to debug something deep inside the guts?”
It turns out that it’s liberating. There are still a lot of people who undervalue the convenience of having somebody else responsible for the physical servers and the upgrades and the replacement of servers that fail and applying security patches. Don’t give up a massive amount of convenience for a possibility that you might need an amount of control that you don’t even really need. You can still get access to all the logs, and you can contact support and be like, “I need you to check X, Y, Z on this machine.” If it is a problem, just tell them, “Provision me a new machine.” Click a button and it’s done. I think that, in a few years, it’s just going to be inconceivable that we even considered hosting data centers ourselves.