You can’t spell data science without data. Okay, that’s cheesy but it’s true! Most (if not all) of the time, the data you need is stored in a DBMS (database management system) on a remote server or your hard drive.
This means you need to interact and communicate with this DBMS to both store and retrieve data but to interact with the DBMS, you need to speak its language: SQL (Structured Query Language). (Note: Over the years, people have begun referring to databases themselves as SQLs.)
Recently, another term surfaced: NoSQL databases. Whether you’re just starting with data science or have been in the field for a while, you probably have heard of both SQL and NoSQL databases.
Whether to use SQL or NoSQL databases depends on your data and target application. But, let’s say you’re using Python and you already know which database schema you’re going to use. The question now is...which Python library do use?
In this article I’ll cover the most well known, used and developed Python database libraries. We’ll talk about each library itself and the best reasons to use each one.
Python SQL Libraries
Python SQL Libraries
We use SQL libraries with relational databases (RDBMS). Relational databases store data in different tables and each table contains multiple records. These tables are connected using one or more relations.
SQLite was originally a C-language library built to implement a small, fast, self-contained, serverless and reliable SQL database engine. Now SQLite is built into core Python, which means you don’t need to install it. You can use it right away. In Python, this database communication library is called sqlite3.
Use SQLite when...
you’re a beginner just starting to learn about databases and how to interact with them.
you’re using embedded applications. If your application requires portability, go with SQLite because SQLite has a small footprint and is very light-weight.
your data is stored in a file on your hard drive.You can use SQLite as a parallel solution for client/server RDBMS for testing purposes.
you need a fast connection to your data. You don’t need to connect to a server to use SQLite, which also means the library has low latency.
SQLite is not the best option if concurrency is a big concern for your application because the writing operations are serialized. Moreover, SQLite is weak when it comes to multi-user applications.'
MySQL is one of the most widely used and well known open-source RDBMS connectors. It employs a server/client architecture consisting of a multi-threaded SQL server. This allows MySQL to perform well because it easily utilizes multiple CPUs. MySQL was originally written in C/ C++ and then expanded to support various platforms. The key features of MySQL are scalability, security and replication.
To use MySQL, you need to install its connector. In the command line, you can do that by running:
python -m pip install mysql-connector-python
Use MySQL when...
you need extra security. Due to MySQL security advantages, it’s optimal for applications requiring user or password authentication.
you need multi-user support. Unlike SQLite, MySQL supports multi-user applications and is a good choice for distributed systems.
you want advanced backup and interaction capabilities, but with simple syntax and no-hassle installation.
MySQL, however, performs poorly when you execute bulk INSERT operations, or you want to perform full-text search operations.
PostgreSQL is another open-source RDBMS connector that focuses on extensibility and uses a client/server database structure. In PostgresSQL, we call the communications managing the database files and operations “the Postgres process,” which is where the library gets its name.
To communicate with a PostgresSQL database, you need to install a driver that enables Python to do that. One commonly used driver is psycopg2. You can install it by running the following command-line instruction:
pip install psycopg2
Use PostgreSQL when...
you’re running analytical applications data warehousing. PostgresSQL has outstanding parallel processing capabilities.
you need your database to adhere to the ACID (A: atomicity; C: consistency; I: isolation; D: durability) model (mostly financial applications). In this case, PostgresSQL provides an optimal platform to do so.
you need research and scientific projects databases.
PostgresSQL is a bit more complex to install and get started with than MySQL. That said, it’s worth the hassle considering the countless advanced features it provides.
Python NoSQL Libraries
Python NoSQL Libraries
NoSQL databases are more flexible than relational databases. In these types of databases, the data storage structure is designed and optimized for specific requirements. There are four main types for NoSQL libraries:
MongoDB is a well-known database data store among modern developers. It’s an open-source document-oriented data storage system. We commonly use PyMongo to enable interaction between one or more MongoDB instances through Python code. MongoEngine is a Python ORM written for MongoDB on top of PyMongo.
To use MongoDB, you need to install an engine and the actual MongoDB libraries.
pip install pymongo==3.4.0 pip install mongodb
Use MongoDB when...
you want to build easy-to-scale applications you can easily deploy.
your data is document-structured but you want to harness the power of relational database functionalities.
you have an application with variable data structures such as IoT applications.
you’re working with real-time applications such as e-commerce applications and content management systems.
Redis is an open-source, in-memory data structure store. It supports data structures such as strings, hash tables, lists, sets and so much more. Redis provides high availability via Redis Sentinel and automatic partitioning with Redis Cluster. Redis is also benchmarked as the fastest database in the world.
You can set up Redis by executing the following instructions from the command-line:
wget http://download.redis.io/releases/redis-6.0.8.tar.gz tar xzf redis-6.0.8.tar.gz cd redis-6.0.8 make
Use Redis when...
speed is a priority in your applications.
you have a well-planned design. Redis has many defined data structures and gives you the chance to define explicitly how you want your data to be stored.
your database has a stable size. Redis can increase lookup speed for specific information in your data.
Apache Cassandra is a column-oriented NoSQL data store designed for write-heavy storage applications. Cassandra provides scalability and high availability without compromising performance. Cassandra also provides lower latency for multi-user applications.Cassandra is a bit complex to install and get started. However, you can do so by following the installation guide on the Cassandra official website.
Use Cassandra when...
you have huge amounts of data. Cassandra has great flexibility and power to deal with incredible amounts of data, so most big data applications are a good use case for Cassandra.
you need reliability. Cassandra provides stable real-time performance for streaming and online-learning applications.
security is a priority. Cassandra has powerful security management, which makes it a great fit for fraud detection applications.
Neo4j is a NoSQL graph database built from the ground up to leverage data and data relationships. Neo4j connects data as it’s stored, enabling queries at high speed. Neo4j was originally implemented on Java and Scala and then extended to use in different platforms, such as Python.
Neo4j is essentially a graph database library and has one of the best websites and technical documentation systems out there. It’s clear, concise and covers all questions you may have about installing, getting started with and using the library.
Use Neo4j when...
you need to visualize and analyze networks and their performances.
you’re designing and analyzing recommendation systems.
you’re analyzing social media connections and extracting information based on existing relations.
you’re going to perform identity and access management operations.
you need to perform various supply chain optimizations.
Choosing the correct database for your data structure and application can decrease your application’s development time while increasing the efficiency of your work. Developing the ability to choose the correct database type on the fly may take a little time, but once you do, most of the tedious work on your project will be much simpler, faster and more efficient. The only way to develop any skill is to practice. Another way to explore is through trial and error (usually my method). Try different options until you find one that resonates best with you and fits your application.