According to DB-Engines, Redis is the #1 key value store and #7 overall most popular database. It’s a versatile in-memory data store that’s been around since 2009 and remains very popular for many types of use cases. Redis is more powerful, more popular, and better supported than the closest competitor, Memcached. Memcached can only do a small fraction of the things Redis can do. In this Redis Jetpack guide, you’ll learn some basics about Redis, use cases and CLI commands, and much more.
Redis is a versatile in-memory data store that’s been around since 2009. It’s used by a lot of recognizable names like Twitter, Github, Pinterest, Snapchat, Digg, and StackOverflow.
What is Redis
Redis (REmote DIctionary Server) is an in-memory, key-value database.
Redis supports different kinds of abstract data structures, such as keys/strings, lists, sets, sorted sets, hashes, bitmaps, hyperloglogs, streams, bitmaps, and spatial indexes. In order to achieve its outstanding performance, Redis works with an in-memory dataset.
According to DB-Engines, Redis is the #1 key value store and #7 overall most popular database.
Why Use Redis
Redis is a remote data structure server. Yes, you could just store the data in local memory. However, there are a lot of benefits for apps using Redis:
- Redis can be accessed by all the processes of your application, possibly running on several nodes (something local memory cannot achieve).
- Redis memory storage is very efficient and done in a separate process. If your application runs on a platform where memory is collected (such as node.js and Java), it allows handling a much bigger memory cache/store.
- Redis can persist the data on disk.
- Redis is much more than a simple cache: it provides various data structures, various item eviction policies, blocking queues, pub/sub, atomicity, Lua scripting, and more.
- Redis can replicate its activity with a master/slave mechanism so that it’s highly available.
- If you need your application to scale on several nodes sharing the same data, Redis is fantastic.
- Redis is more powerful, more popular, and better supported than the closest competitor, Memcached. Memcached can only do a small fraction of the things Redis can do.
- Like a cache, Redis stores key-value pairs. However, unlike a cache, Redis lets you operate on the values.
Here are the data types that can be used in Redis:
- Sorted sets
- Spatial Indexes
Top 5 Use Cases for Redis
While some use cases are more common than others, keep in mind that Redis doesn’t have to work alone. You can choose to leverage its speed and flexibility in conjunction with other technologies to help build bridges and even new platforms, as well as to streamline communications between services using Redis.
One of the most apparent use cases for Redis is using it as a session cache. The advantages of using Redis over other session stores, such as Memcached, is that Redis offers persistence. While maintaining a cache isn’t typically mission critical with regards to consistency, most users don’t enjoy it when all of their cart sessions go away.
Luckily, with the steam Redis has picked up over the years, it’s pretty easy to find documentation on how to use Redis appropriately for session caching. Even the well-known e-commerce platform Magento has a plug-in for Redis!
Full Page Cache (FPC)
Outside of your basic session tokens, Redis provides a very easy FPC platform. It’s all about consistency. Even across restarts of Redis instances, with disk persistence your users won’t see a decrease in speed for their page loads—a drastic change from something like PHP native FPC. For sites concerned with high traffic, speed for logged-in users, or dynamic page loads, a high-speed and persistent object cache is a must.
For WordPress users out there, Pantheon has an awesome plugin named wp-redis to help you achieve the fastest page loads you’ve ever seen!
Taking advantage of Redis’ in-memory storage engine to do list and set operations makes it an amazing platform to use as a message queue. Interacting with Redis as a queue should feel native to anyone used to using push/pop operations with lists in programming languages such as Python.
If you do a quick Google search on “Redis queues,” you’ll soon see that there are tons of open source projects out there aimed at making Redis an awesome backend utility for all your queuing needs. For example, Celery has a backend using Redis as a broker.
Since it’s in-memory, Redis does an amazing job at increments and decrements. Sets and sorted sets also make our lives easier when trying to do these kinds of operations and Redis offers both of these data structures. To pull the top 10 users from a sorted set—we’ll call it “user_scores”—one can simply run the following:
ZRANGE user_scores 0 10
This assumes you’re ranking users on an incremental score. If you want to return both the users and their score, you could run something like this:
ZRANGE user_scores 0 10 WITHSCORES
Last, but certainly not least, is Redis’s Pub/Sub feature. Pub/Sub is shorthand for the Publish/Subscribe model of distributed systems. The use cases for Pub/Sub are truly boundless. We’ve seen people use it for social network connections, triggering scripts based on Pub/Sub events, and even a chat system.
Of all the features Redis provides, this one always gets the least amount of love, even though it has so much to offer its users.
In this model, there is a central broker of information. There are information providers who publish the information to the broker. Clients then subscribe to the broker to get updated data.
Top 10 Redis CLI Commands
Redis uses a very straightforward command line interface. Though it’s relatively simple, it does provide some interesting features that one might not expect.
Let’s go over some of the basics and work our way around most of the client’s functionality and features.
To start, we have a simple connection:
> redis-cli -h 127.0.0.1 -p 6379 -a mysupersecretpassword
Alright! We’ve connected to our very own Redis server and authenticated.
Alternatively, you can omit the -a option and authenticate after you connect:
> redis-cli -h 127.0.0.1 -p 6379
127.0.0.1:6379> AUTH mysupersecretpassword
If you have your Redis server and client running on the same machine, you might choose to connect via a Unix socket.
Note: If you still provide a hostname and port as well as a socket, redis-cli will still connect via the Unix socket.
> redis-cli -s /tmp/redis.sock
127.0.0.1:6379> AUTH mysupersecretpassword
Now that we understand how to connect and authenticate to our Redis instance via the command line, let’s see some examples of useful things we can do with it.
Let’s say you want to execute a command via the command line and only want its output to be returned to standard out:
> redis-cli -h 127.0.0.1 -p 6379 -a mysupersecretpassword PING
Or, perhaps you’d like to execute the same command n number of times:
> redis-cli -h 127.0.0.1 -p 6379 -a mysupersecretpassword -r 4 PING
Notice that we added a -r to our command to supply the “repeat” option. Alternatively, we can add a delay using -i in conjunction with -r.
> redis-cli -h 127.0.0.1 -p 6379 -a mysupersecretpassword -i l -r 4 PING
This adds a one-second sleep between each PING command. You can also supply subseconds to this option by using an a float:
> redis-cli -h 127.0.0.1 -p 6379 -a mysupersecretpassword -i 0.1 -r 4 PING
This would run the PING command every 10th of a second.
To generate some simple diagnostic information about the Redis instance you are connected to, simply run redis-cli with the –stat option.
Here we can see:
- How many keys are set on the server.
- The server’s total memory usage.
- The total number of clients connected or blocked.
- The total number of requests the server has served.
- The total current number of connections.
This command is useful to get an overview of the Redis server as a whole. Think of it like stating a file.
Now that you know how to generate some simple stats about a Redis server, let’s check the latency of incoming Redis commands.
This is very simple and can be done via the command line:
Here we see the minimum, maximum, and average request time, as well as the number of samples taken.
Note: These are recorded in microseconds. For more info about Redis latency, take a look at documentation for latency monitoring.
To analyze your keyspace in search of large strings or other data structures, run the –bigkeys option. This is good to use to find large keys in our keyspace, as well as to get a count of the overall distribution of key types.
This gives us a lot of useful information back about different keys, including their types and sizes.
Overall, the Redis CLI is a powerful tool to help you manage your Redis instance. The ability to use its built in options can really help in analyzing a problematic Redis server.
Key Operations and Common Uses for Redis
redis-cli> set test key1
EXISTS (Boolean Response)
redis-cli> exists test
redis-cli> get test
SEARCH FOR A THING
redis-cli> sscan series 0 match *allthe*
2) 1) "allthethings"
redis-cli> del test
redis-cli> get test
So while there are some examples of the SMEMBERS command for search, below is an example using SSCAN which is a “cursor based iterator” rather than full out locking command.
redis-cli> SADD series anything (integer) 1
redis-cli> sadd series everything
redis-cli> sadd series allthethings
redis-cli> sadd series somethings
redis-cli> sadd series onethings
redis-cli> sscan series 0
2) 1) "anything"
While Redis is designed to be accessed by trusted clients, you can, however, implement security from multiple levels:
- Implement security around the application
- SSL encrypt the pipeline your passing Redis from/to client<->server
- Set requirepass in redis.conf or config set
- Rename the config operation so that it’s not available
Depending on the needs of your application, you can choose to leverage all of these elements together, or on an individual basis.
Python Cache: One extremely useful script that I have used in several applications is redissimple-cache. I recommend this code because of its simplicity, ease-ofuse and consistent performance, if you choose not to use the de-facto Werkzeug implementation. And that’s it. Awesome, right?
PHP: While predis is widely seen as the Redis driver of choice, it can occasionally have scalability issues streaming at high volumes. In this case, take the time to install the C-based driver, phpredis.
Others: Many other great options are generously provided on Redis.io.
There are many useful real-world examples of Redis being used to improve system intercommunication and to build platform scalability. However, as the technology continues to evolve, it’s important to keep a pulse on how its capabilities can benefit your business.
Reliable PUBSUB and Blocking List Operations
The standard redundant Redis solution is to run master/slave replication with Sentinel managing the failover. This is expected to be followed up with either client support and use of Sentinel to discover the current master or a TCP proxy in front of the Redis pod which is managed by Sentinel to point to the master.
Redis Sentinel is designed to use Sentinel. However, using a TCP proxy is a growing trend and the way ObjectRocket Redis is configured.
The Effects of Failover on Commands in Redis
For the vast majority of Redis operations, this works as expected. On a failover, the client code is expected to detect the connection has failed and either look up the current master and reconnect or attempt a reconnect based on the scenarios mentioned above. The proxy-based scenario is also identical to a Redis server restart so this is also applicable to a case where you have a single instance Redis which can be restarted.
However, a small handful of Redis commands are long-term blocking or nonatomic. The PUBSUB subscribe and psubscribe commands will register the request to have any messages sent to the given channel(s) sent to it as well, but the actual data can arrive much later (after it is published). The BL* commands are long-term and two phase commands; like the subscribe commands, they will block and not return until an item is available in the list to be removed.
A key commonality between these commands is that they are issued and a form of registration is done on the server, but the data can come later. In between the command being issued and the server having data to reply with failovers, restarts can occur. In the case of a restart or a failover, the “registration” is no longer on the server. This brings us to the question of what happens regarding these commands in these scenarios.
Two Types of Failovers
Another aspect to bring up (as it will come into play during the testing and has dramatic effects on results and how to code for the failure scenarios) is that there are two types of failovers.
Failover 1: The failover most of us consider is the loss of the master. In this type of failover, the master is non-responsive. I’ll refer to this as a triggered failover.
Failover 2: I’ll refer to the second class of failovers as the initiated failover. The initiated failover takes place when an administrator sends the failover command to a Sentinel which then reconfigures the pod, promoting a master then demoting the original master.
The Key Difference: On the surface these two scenarios appear to be the same. In the case of the vast majority of Redis commands, they can be treated identically. However, in the case of our long-block commands they must be understood in their own context.
The key functional differentiator between these two failover classes is the sequence of events. A triggered failover occurs because the master is not responding. Because of this, no data is being added or modified and no messages are published on the master. In an initiated failover, there is a very brief window where there are two masters.
The techniques and scenarios discussed here will be described in the general and demonstrated using the Python Redis library “redis-py”, but apply to any client library.
Now that you’ve got the background information, we can look at how this affects the use of the long-block commands. First, let’s look at PUBSUB.
Redis PUBSUB consists of three commands: PUBLISH, SUBSCRIBE, and PSUBSCRIBE. The PUBLISH command is used to send a message. SUBSCRIBE and PSUBSCRIBE are used to register for and receive messages published via the PUBLISH command.
Let’s examine what happens in a PUBLISH command.
When you execute PUBLISH, it will publish the message to the given channel immediately and return the number of clients which received it. How does a failure scenario make this relevant? Consider the case of failover; the client will have to reconnect as the TCP connection goes away. If talking to a proxy, it will do a direct reconnect and get a different server. If doing Sentinel discovery, it will take a bit longer and still connect to a different server.
The issue is in timing. If the publisher connects to the new server before any subscribers do, the message is gone. Lost. Can that happen? To understand that, we look to the SUBSCRIBE commands.
When your code issues a subscribe command, the server registers in its inmemory data structures so that the specific connection should get messages on the channel(s) subscribed to. This information is not propagated to slaves. When a failure happens and the client reconnects, it must re-issue the subscribe command. Otherwise, the new server does not know that particular connection needs to receive certain messages. Now we have the clear potential for a subscriber to reconnect to the new master after the publisher. The conditions for loss of messages now exist.
How to Minimize the Risk
There are a few options.
Option 1: First, the publisher can leverage the fact that PUBLISH returns the number of clients which receive it. If zero clients receive it, the publisher can retry. For systems where this is a static number of subscribers or where “at least one subscriber” is sufficient, this prevents the message loss. For systems which do not meet this criteria, there is another, if less robust, option.
Option 2: The second option is to control the reconnect windows. The rule of thumb here is to have the publisher delay be at least three times as long as the subscriber delay. This will provide at least three opportunities to have the subscribers reconnect prior to publishing messages. So long as the subscribers are online first, the messages will not go into the ether. While this is more robust than not controlling it, there is still the possibility of message loss.
Option 3: The third option to mitigate this race condition is to build in a lock mechanism. This is certainly the most complex route because it requires building or employing a mechanism which prevents the publishers from publishing until the clients connect. The basic process is to have some form of shared location (such as via Zookeeper, a database, Consul, etc.) where the subscribers register that they are ready to receive and the subscribers check for valid and ready subscribers before publishing messages. To further add complexity, the subscribers will have to “unset” themselves in this mechanism when they lose the TCP connection.
It needs to be called out that this is all for a triggered failover. It will work because the master is non-responsive so no messages go to it.
In the initiated failover scenario, the master will continue accepting messages until it is no longer the master or is disconnected. None of the above methods will properly and fully catch this scenario.
In the case of system maintenance (an initiated failover), your best bet is to employ both the first and second options, and accept that there can be message loss. How much message loss is entirely dependent on your publishing schedule. If you publish a message on average every few minutes, there is a minuscule chance you’ll actually encounter the loss window. The larger the publishing interval, the smaller the risk. Conversely, if you are publishing hundreds or thousands of messages per second you will certainly lose some. For the managed proxy scenario, it will be dependent on how quickly the failover completes in your proxy. For ObjectRocket’s Redis platform, this window is up to 1.5 seconds.
One approach on the interval which fails at the limits is to have the subscribers try to reconnect at an interval of 1/3rd your publishing interval. Publish a message once per minute and then code/configure your subscribers to reconnect at least as often as every 20 seconds and configure your publisher every minute or two. As the publishing interval approaches the three second mark, this becomes harder to achieve as a failover reconnect process (lookups, TCP handshake, AUTH, SUBSCRIBE, etc.) can easily total a few seconds.
BLPOP and Friends
How do blocking list commands operate under these conditions? The news here is better than for publishing. In this case, we are modifying data structures which are replicated. Furthermore, because these are modifications to data structures, we don’t have the same level for risk of lost messages due to order of reconnect. If the producer reconnects prior to a subscriber, there is no change to expected behavior. The producer will PUSH the item onto the list (or POP and PUSH) and once the consumer connects, the data will be there. With one exception.
In the case of an initiated failover, we must consider the brief multiple-master problem. For a few moments (on the order of milliseconds) during an initiated failover the original master will still accept data modifications. As a slave, it has already been promoted to master these modifications will not replicate. This window is very small, but it is there. As with publishing, this is essentially a corner condition where your odds of, for example, issuing a POP command to the master, is rare. It increases to conceivable conditions when your rate of producers and consumers making list modification commands reaches thousands per second.
In the case of a worker doing BRPOPLPUSH, for example, the item being moved will be “back” in its original position after the failover. In the case of a BLPOP, the result will be essentially the same. After the failover completes, the item will appear to be re-queued. If your items are idempotent jobs, this will not be an issue. How much defensive coding you should do to account for this in the case of non-idempotent jobs is a result of determining the effect of a job being run or item processed twice (factoring in how frequently you are making modifications) and this the likelihood of encountering this situation. Since this only happens in the event of an initiated failover, this should be something under operational control and it’s advisable that whenever possible maintenance should be done at a time when your systems are at their least usage to minimize or even eliminate this possibility.
For the triggered failover, there is no expectation of data loss. A triggered failover happens when the master is unresponsive, so no modifications are being made to the master. However, as with each of these scenarios, there is the matter of handling the actual mandatory TCP reconnect in any failover scenario.
Under either failover scenario, triggered or initiated, the client will have to detect and reconnect either after a Sentinel lookup or a brief window to the same address. In the case of ObjectRocket Redis, which utilizes a managed proxy layer, the client will simply reconnect. The actual failover process itself can take up to two seconds to complete. Thus, the client code should account for this. Ideally, an immediate retry should take place to handle network blips where a connection is simply dropped somewhere along the route. However, this should be accompanied with a back algorithm with retries to account for cases like server restarts (as happens in either standalone Redis restarts or a proxy restart).
Upon reconnect, any subscribers need to re-subscribe since the link between the request channels and the new TCP connection must be established. Redis’ PUBSUB channels are created when either a subscriber or publisher attempts to access them, so there is no need for “recreating” the channel.
What’s the Best Way to Handle Reconnections?
The answer is highly dependent on the client library in use and how it handles disconnects. In an ideal scenario, the connection would be configurable for automatic retry with an ultimate failure returned if boundaries are met. Testing results vary dramatically between libraries, so let’s discuss a rather common one: redis-py.
Redis-py appears to retry when the connection drops. Unfortunately, it immediately retries, and thus is too quick to reliably recover the connection during a failover. There appears to be no way to configure this. As a result, your code must catch/detect the failed reconnect attempt and manage the reconnect yourself.
First, let us examine some standard redis-py publish and subscriber code blocks:
This is fairly straightforward. However, when the immediate retry during the subscriber’s for-loop fails, you will get a redis.ConnectionError exception thrown. The tricky bit is it happens “inside” the for item in p.listen() line. To properly catch it, you must wrap the entire for statement in a try/except block. This is problematic at best and leads to unnecessary code complexity.
An alternative route is to do the following instead:
With this method, we call get_message() directly which allows us to catch the exception at that point and re-establish the ‘p’ object’s connection. The amount of time to sleep, if at all, is dependent on your code’s requirements. Of course, if your subscriber expects to handle a certain number of messages and a for-loop works better, it will still work. For publishers, the code is simpler as it is generally not run on an iterator:
With this method, you control whether, when, and how often to retry. This is critical to transparently handling the failure event properly. This would be the same thing necessary if the code is using Sentinel directly or even if the Redis server in question restarted. Really all publishing code should follow this basic mechanism.
For the blocking list commands, such as BLPOP, the order of client reconnection isn’t strictly important as the data is persisted. The try/except method described above is necessary to re-establish the connection when the command execution results in a redis.ConnectionError exception being thrown.
Using these techniques with a 3-second retry window for redis-py on the ObjectRocket platform ensures that there is no data loss for triggered failovers. There’s approximately 1.5 seconds of potential message loss for non-stop publishers during an initiated failover.
While the code examples are specific to redis-py, the basic techniques should be used with any code that needs to handle server reconnects and makes use of either the PUBSUB commands or blocking list operations.
Armed with this knowledge, you can now implement a highly available Redis pod and know your application can handle failovers without waking your ops team up in the middle of the night.
Source: Team ObjectRocket Redis Jetpack