Pro-style testing

Mike Solomon

If you write software professionally, you probably write automated tests. This is fantastic.

But have you ever thought about how to leverage the experiences of other software engineers to:

Write tests that are maximally likely to prevent bugs
Write tests that make locating and fixing the cause of a bug easy
Write as few (and as short and readable) tests as possible while achieving the above

Below are general guidelines to build a mental framework of what, how, and why to test in any language. There are also specific and hard-earned recommendations for and against a variety of possible testing strategies. Most recommendations come with links to more reading.

If you want to test like a pro, read on. If you disagree with a recommendation or would like elaboration, leave a comment or send me an email. There is always room to improve.

Table of Contents
Terminology
- Test sizes
- Properties of tests
General principles to follow
Mistakes to avoid
- Don’t write change-detector tests
- Don’t test code you don’t own in Small or Medium tests
Red flags and code smells
Concrete tips
More about testing
Reading and resources

Terminology

Language around automated testing is often ambiguous and overloaded. I will use these terms:

Test sizes

Terms like “unit test” and “integration test” can mean different things to different people, so we will use test sizes as defined by Google, recapped here:

Small: Usually called unit tests, Small tests are each extremely narrow in scope, run quickly, and test behavior in isolation.
Medium: Sometimes called integration tests, Medium tests check interactions between layers and components.
Large: Also called end-to-end or system tests, Large tests are very coarse-grained and often touch many components and make use of the network.

Properties of tests

Fidelity: A high-fidelity test is sensitive to defects in the code under test: the test fails when the code is broken.
Resilience: A resilient test fails only when the code under test is broken–refactoring won’t break it, and it is not flaky.
Precision: A high-precision test tells you where the defect is. Ideally the exact line number and what differed from our expectations.

General principles to follow

Test one behavior per test

Each test should test one behavior. Many of your methods will have one behavior, so verify that behavior and as little else as possible (often nothing!). Asserting runtime invariants is okay, but usually there should be few assertions other than the primary expected behavior.

Test each behavior once

Testing the same thing more than once is a maintenance burden. Obviously you should not test the same behavior with two separate tests, but sometimes it is tempting to “cross-test” by adding an extra assertion in a related test. Avoid this, because it decreases the Precision when the test fails, and because it violates “Test one behavior per test.”

Too Many Tests

Write tests that provide value by reducing risk

A test should reduce risk, or it is not providing any value.

One way to check this is to ask “what class of bug could this test detect?” If there is no answer, there should be no test. You can rephrase as “what risk does this test help us avoid?”, and if there is no answer, you need no test.

It works the other way too: think of what the risks (possible classes of bugs) are, and write the appropriate tests to detect them.

One case that obviously provides value is a regression test: you’ve encountered a bug before, so it’s important to have a test to prevent it from reappearing in the future.

Name tests to describe the behavior precisely

Test names appear in test failures and in the code itself. If the names precisely describe the behavior being tested, readers do not need to read the test to understand what cases are covered and which aren’t, and failures become easier to debug and fix.

Tests are often a good way to learn how an interface works, and clear test names can be useful to demonstrate an interface.

Writing Descriptive Test Names

Rework code until it is easy to test

You must test your code, so your code must be easy to test. If you write your code before your tests without keeping this in mind, you may not notice until you begin writing tests. When this happens, consider reworking your code to be more testable.

If your test is long, your code may need to change to improve testability.

These strategies can make your code more testable:

Ensure your methods, classes, and modules each have only one concern, one job, one reason to change. Each should deal with one thin layer only, and rely on other methods/classes/modules to deal with other layers.
Reduce the number of dependencies. Are you sure the code under test has only has one reason to change?
Use dependency injection
Avoid static methods (in Scala, look for companion objects and other objects)
Avoid singletons (in Scala, look for companion objects and other objects)
Wrap unavoidable singletons or static methods (such as those provided by a library) in a simple class that can be injected as a dependency
Inject small, single purpose methods that encapsulate dependencies–these are easy to test, and prevent overreliance on mocking

Watch it fail

It’s tempting to write a test, see that it passes, and move on. But what if you made a mistake in your test? You probably don’t have tests for your test, so instead, break your code in a way the test should detect, and run it.

This avoids two classes of bugs:

Your test won’t detect the bug you thought it would (low Fidelity)
Your tests aren’t actually being run (it happens)

Tests should use literals where possible

In production code, deduplication and flexibility are very important. Surprisingly, in tests, it is often better to duplicate and inline simple values and literals to reduce the likelihood of mistakes and to improve the direct readability of the test. Simple immutable objects shared across tests are also acceptable.

For example, URLs strings should be literal values in tests instead of being constructed as URL objects. This sort of duplication is more readable, simpler, and less error prone. In exchange, it is very inflexible–but this is a better tradeoff in a test.

See Don’t put logic in tests

This does not apply as much to property checks, which should use generators where possible.

Leverage the type system

Careful use of statically type-checked languages render entire classes of Small tests unnecessary because the type checker can enforce certain guarantees. Availability of static typing features vary by language; make use of those that are available, and consider this when choosing a new language.

Carefully choose the types of primitives so they enforce as many guarantees as possible. For instance, prefer an unsigned integer over a signed integer when a value cannot be negative; this eliminates the need for one test. This principle applies similarly to objects and other derived types. Consider introducing new types that can only be constructed with guarantees that will later be relied upon, this removes the need for checking these guarantees in the code relying on them. Consider refining interfaces to accept only values maximally verified by the type system.

A unit tester walks into a bar

Mistakes to avoid

Don’t write change-detector tests

One way to test code is to duplicate some of the logic you are trying to test in the test itself, then assert that the results are equal.

This only detects when your code changes, and cannot catch any bugs apart from “the code changed.” Such a test has low Fidelity and low Resilience. Such a test is a maintenance burden. Rewrite or delete.

One common form of change-detection is a test that checks each step of the implementation. Test behavior instead.

A very specific type of test that looks like (but is not) a useless change-detector can provide refactor/optimization safety (yet no value up until then): a test that reimplements the code under test and compares the outputs. This verifies that the underlying behavior has not changed. This type of test is easy to misapply. It is insufficient on its own. Prefer other types of tests when possible, perhaps simple property checks.

Change-Detector Tests Considered Harmful

Don’t test code you don’t own in Small or Medium tests

Tests should live in the same project as the code that they test, and should be maintained by the same people. This gives the owners freedom to refactor and make bug fixes as needed, provided their tests still pass. This lets the people best suited to test and maintain test code do so. This reduces your own maintenance burden. Note that it makes sense to test Adapter or other code that wraps a dependency.

Most dependencies will be services or libraries. If you do not trust a dependency, consider contributing new tests to cover the cases they do not. If you still don’t trust a dependency, consider removing or replacing it. If you cannot contribute to a dependency directly, consider maintaining a patch, or if necessary, consider a fork. If you have a binary or service dependency that you cannot contribute to, eliminate, or trust, consider writing a separate suite of tests to ensure it works as you expect. In no case should you test an external dependency as a side effect of testing your own code in a Small or Medium test.

Large tests may implicitly test external dependencies; this is to be expected. Even so, they should not explicitly test external dependencies beyond, say, setting up connections.

Red flags and code smells

Long tests. Tests should generally be short and easy to follow. Arrange, act, assert (see AAA below)
Sleeping (Thread.sleep, Future.sleep, sleep(), etc.). There are very few places this is actually what you want.
Many mocks (specifically mocking, not other test doubles). You may be testing the implementation too closely. The code under test may have too many dependencies, and it may have more than one concern.
The test generates nontrivial data. There may be bugs in the data generation code. Consider separating it out and testing it. Consider using a property check, which can help make this reusable. Consider breaking the code under test into multiple methods which can be tested on simpler data.
Tests with logic that also appears in the code under test. Is this a change-detector test?

Concrete tips

Hat-tip to Ryan Greenberg, from whom I stole most of this section.

AAA test structure

Many tests are easy to read if they are in the form: Arrange, Act, Assert. First Arrange the required objects, perform the Act you want to test, then Assert the results are as expected.

Write the assertion first

Think of test cases in terms of properties that must be true, then assert them. It may be easier to think of the assertion first, then write code to arrange objects and act on them.

Write exactly one test for each equivalence class

For example, if the code is intended to work the same on any number of items in a sequence, you don’t need a test for 2 items, 3 items, and 4 items.

When testing state changes, assert before as well as after

For example, if a method should increment a counter, assert that the counter value starts at what you expect before calling the method. This avoids certain bugs in tests.

Only control direct dependencies, not dependencies of dependencies

Only set up and rely on direct dependencies of what you are testing (possibly using a test double such as a stub, mock or fake), never dependencies of dependencies.

For example, imagine:

We have a request handler logValidRequests that validates a request req by calling validate(req) and then logs req
logValidRequests won’t log req when validate returns false
One way req can be invalid is if it is all lowercase

You should not write your test by calling logValidRequests with an all-lowercase req. Instead, stub validate to return false, then assert that nothing is logged (and don’t forget other test cases!). This improves Resilience and Precision.

Assert on boundaries for functions accepting a contiguous range of inputs

def isBig(num: Long) = num > 100

You should test 100, but also 99 because it is at the boundary of the change. Even better, write a property test. Remember to test each equivalence class exactly once.

More about testing

Property-based tests

Property-based tests (also called property checks) are a different way to think about testing. The basic idea is to assert that some law holds about the code under test, and then let the test framework generate test cases in an attempt to disprove the law. When it does so, it will try to find a minimal failing case to help you find your bug.

It is worth writing property checks if you can, despite the initial learning curve. They allow you to declare laws and let the computer worry about coming up with cases that are likely to fail. They encourage writing reusable Generators that improve readability and reuse.

Property-based tests are most useful in unit tests.

In Scala there is ScalaCheck and more can be found online.

Large tests

Large tests are your last line of defense before production (or “reported by users”). Not all tests are equally useful at this level.

Test “happy path” behavior. This makes sure that the system works end-to-end in the real environment. Depending on your setup, you may be able to run this in a staging environment as well as the production environment.

Test for regressions in known high-level bugs. If you can write a Small or Medium test for this, prefer that instead. However, make sure each regression gets a test, and sometimes this means a Large test.

Don’t attempt to test every way in which your system can fail. For example, if you have a suite of validations that are already tested in Small tests, do not repeat every test at the Large (or Medium) level. Instead, test one representative validation to ensure that the validations are wired in. Even better, test at the Medium level.

Refactoring tests

It can be hard to refactor your tests, because unlike your production code, you don’t have tests (for your tests).

One good strategy is to refactor your test code after manually (and temporarily) breaking the production code. This gives you some confidence that your tests fail when they ought to fail (showing their level of Fidelity).

Refactoring tests in the red

Reading and resources

More terminology

Box colors

Black box: Knows nothing of internals–testing the interface’s contract, not implementation
White box: Testing the internals–testing the implementation, not the interface
Grey box: Testing interface’s contract as in black box, but sets up state beforehand with knowledge of internals

Subtypes of tests

Regression: did a bug we fixed reappear?
Performance: how fast is the code? is it fast enough?
Security/Privacy: will this leak data or allow unwanted access?
Code quality: does the code meet standards we can automatically (statically) measure?
Acceptance: Does it conform to specifications?
Stress: How does it handle being put under increasing loads, up to failure?

Good resources

Google’s Testing Blog, including Testing on the Toilet
Clean Code by Robert C. Martin
Writing testable code

msol