Monday, May 21, 2012

NoSQL vs SQL

I've been exposed to NoSQL for about a year now, and I think I am finally able to describe what it is to other people.

NoSQL is a marketing term, well-named to catch buzz, and ill-named to describe what it is.

It kind of implies that we should throw SQL away and move to this hip database. Before all the NoSQL fuss, the most popular databases were relational databases, which supported a language called SQL. The marketing idea was to say, we're the opposite team, so we're "NO SQL". I think it worked.

The Traits


There are traits that relational databases vs NoSQL databases have. I think viewing them like this gives a better understanding of the differences.

How to Represent Data


At the end, a database is a system where you give it a question and it returns results. If you think of it as a map, you give it a key, and it returns a value.

A relational database's key is the primary key of the tables you specify.

For NoSQL databases, the key can be anything you want (including something like primary keys).

What people choose to use as a key depends on their design decisions. For example, Google's Bigtable specifies the key to be the row name, column name, and timestamp.

And you can see that nobody's stopping you from using relational databases to have a table whose primary keys are row name, column name and timestamp.

The API


All relational databases I know of, support SQL and some language binding, so that's how you talk to the database.

For NoSQL databases, the API can be anything, but usually is some language binding or web service.

Note that nothing can stop you from supporting SQL in NoSQL databases. It just usually isn't supported by default by all NoSQL database.

Scalability


Relational databases weren't designed from the beginning to be deployed in a distributed environment. They're usually on one computer as one process.

So if you want to support some data partitioning like,

  • for row 1 to row 100, I want to contact the database on machine 1
  • for row 101 to row 200, I want to contact the database on machine 2
  • although they're two database processes running, we actually view it as one logical database,
then your application has to be smart enough to do it.

But still, that means relational databases *can* scale with the help of application level code.

Many NoSQL databases were designed from the beginning to be distributed. This means, you get the partitioning above for free. This is why people say NoSQL databases scale better. It's because you don't need application code to scale.

Note that I said *many* NoSQL databases are distributed. A bunch of them aren't, but they're still called NoSQL because they have other NoSQL traits.

Schema


Relational databases need a schema and they strictly check that data conforms to that schema.

NoSQL databases might or might not have a schema, depending on the design. If you want a schema but the NoSQL database doesn't support it, then you need application level code to do it.

Transaction Guarantees


Relational databases have the ACID guarantee (you don't need to know what it is), but it means you can have transactions any way you like it. For example, your transaction could be modifying all the rows in the database in one shot. But note that whenever you make a relational database distributed though application code, you lose this guarantee just like NoSQL databases.

Distributed NoSQL databases normally have the BASE guarantee. Again you don't need to know the details what it is, but it means that if you have data across two machines, and you do a transaction on this data, then there might be a time lapse where the data in these two machines are not consistent. (They will eventually, but just not immediately.)

This relaxed guarantee is a tradeoff between performance and consistency. Technically, it's possible to have ACID guarantees for distributed transactions, but it doesn't perform well. On the other hand, you can have a pretty efficient distributed database if you relax this guarantee.

That said, if the transaction only involves data living on one machine in a distributed system, ACID can still be guaranteed. This also means that if the NoSQL database is a non-distributed one, then it can provide ACID guarantees easily, because data is on one machine.

So the guarantee boils down to this. If the data you're performing a transaction against is on the same machine, you can provide ACID. If the data is on different machines, but you need to perform a transaction, then there might be some inconsistencies for a short while if you want reasonable performance.

When to Choose Which


If you're deciding whether to use a relational database or NoSQL database, consider what traits do you want your database to have.

TraitRelational DatabasesNoSQL Databases
represent data with tables and primary keysyesmaybe
represent data with anything elsenoyes
sql supportyesmaybe
scalabilitymaybeyes
schemasyesmaybe
transaction on same machineyesyes
transaction across many machineschoose between relaxed guarantees vs efficiencychoose between relaxed guarantees vs efficiency

Any thing other than "yes" means you might need to write your own application level support.

As you can see with some amount of application level code, relational and NoSQL databases pretty much can support the same things.

The reason I see most people use NoSQL is for the non-strict schema and the scalability, which (1) requires a lot of work for application level code to handle, and (2) the current fashion for software is shipping fast (might need to change schema often) and scalable software.

Other things that might affect your decisions are: community support for the database, your boss says so, price, installation difficulty. These topics are pretty self-explanatory, so I didn't include them in this post.

I hope now you have a good answer when the next person asks you why you're using a NoSQL database.

2 comments:

  1. Hey Amarin, awesome article. This is a must read for anyone wondering about the difference between the two. I've been using Google Datastore, and you're spot on in your comparison. One thing I'd like to add is that SQL does not 100% fit on top of NoSQL. You'll find that if you are wanting to write SQL, then you probably want a relational database. When desigining a NoSQL database, you want to go for data de-normalization. This goes against most of our programmer instincts as we don't like to repeat data. On the other hand, this makes it possible to run insanely fast queries on insanely large databases. This does come at the cost of more complex data management, and longer database writes. Overall I'd say I like "NoSQL" but would not use it because it is cool or hip, only after I analyzed the problem, and had determined scalability, and/or large data sets were a factor in design.

    ReplyDelete
  2. Hey Andy, yep, you're totally right.

    Technically one can probably implement the SQL language on top of a NoSQL database, but that would entail performance implications. I left that part out.

    Thanks for your feedback!

    ReplyDelete