/ Proxy

A TCP proxy to simulate network and system conditions for chaos

A TCP proxy to simulate network and system conditions for chaos

Toxiproxy

Toxiproxy is a framework for simulating network conditions. It's made specifically to work in testing, CI and development environments, supporting deterministic tampering with connections, but with support for randomized chaos and customization. Toxiproxy is the tool you need to prove with tests that your application doesn't have single points of failure. We've been successfully using it in all development and test environments at Shopify since October, 2014. See our blog post on resiliency for more information.

Toxiproxy usage consists of two parts. A TCP proxy written in Go (what this repository contains) and a client communicating with the proxy over HTTP. You configure your application to make all test connections go through Toxiproxy and can then manipulate their health via HTTP. See Usage below on how to set up your project.

For example, to add 1000ms of latency to the response of MySQL from the Ruby client:

Toxiproxy[:mysql_master].downstream(:latency, latency: 1000).apply do
  Shop.first # this takes at least 1s
end

To take down all Redis instances:

Toxiproxy[/redis/].down do
  Shop.first # this will throw an exception
end

Why yet another chaotic TCP proxy?

The existing ones we found didn't provide the kind of dynamic API we needed for integration and unit testing. Linux tools like nc and so on are not cross-platform and require root, which makes them problematic in test, development and CI environments.

Example

Let's walk through an example with a Rails application. Note that Toxiproxy is in no way tied to Ruby, it's just been our first use case. You can see the full example at sirupsen/toxiproxy-rails-example. To get started right away, jump down to Usage.

For our popular blog, for some reason we're storing the tags for our posts in Redis and the posts themselves in MySQL. We might have a Post class that includes some methods to manipulate tags in a Redis set:

class Post < ActiveRecord::Base
  # Return an Array of all the tags.
  def tags
    TagRedis.smembers(tag_key)
  end

  # Add a tag to the post.
  def add_tag(tag)
    TagRedis.sadd(tag_key, tag)
  end

  # Remove a tag from the post.
  def remove_tag(tag)
    TagRedis.srem(tag_key, tag)
  end

  # Return the key in Redis for the set of tags for the post.
  def tag_key
    "post:tags:#{self.id}"
  end
end

We've decided that erroring while writing to the tag data store (adding/removing) is OK. However, if the tag data store is down, we should be able to see the post with no tags. We could simply rescue the Redis::CannotConnectError around the SMEMBERS Redis call in the tags method. Let's use Toxiproxy to test that.

Since we've already installed Toxiproxy and it's running on our machine, we can skip to step 2. This is where we need to make sure Toxiproxy has a mapping for Redis tags. To config/boot.rb (before any connection is made) we add:

require 'toxiproxy'

Toxiproxy.populate([
  {
    name: "toxiproxy_test_redis_tags",
    listen: "127.0.0.1:22222",
    upstream: "127.0.0.1:6379"
  }
])

Then in config/environments/test.rb we set the TagRedis to be a Redis client that connects to Redis through Toxiproxy by adding this line:

TagRedis = Redis.new(port: 22222)

All calls in the test environment now go through Toxiproxy. That means we can
add a unit test where we simulate a failure:

test "should return empty array when tag redis is down when listing tags" do
  @post.add_tag "mammals"

  # Take down all Redises in Toxiproxy
  Toxiproxy[/redis/].down do
    assert_equal [], @post.tags
  end
end

The test fails with Redis::CannotConnectError. Perfect! Toxiproxy took down
the Redis successfully for the duration of the closure. Let's fix the tags
method to be resilient:

def tags
  TagRedis.smembers(tag_key)
rescue Redis::CannotConnectError
  []
end

The tests pass! We now have a unit test that proves fetching the tags when Redis is down returns an empty array, instead of throwing an exception. For full coverage you should also write an integration test that wraps fetching the entire blog post page when Redis is down.

GitHub