How to Conduct A/B Tests in Mobile Apps: Part I

Blog

7 February 2021

AppQuantum presents an article based on the webinars “How to conduct A/B tests in Mobile Apps”, created in collaboration with mobile marketing agency, Appbooster. From this webinar series you will find out what A/B tests are conducted for, how to run them in the most effective way and how to avoid mistakes. Furthermore, you will get unique useful cases from the authors’ practice.

A/B tests: what are they and why are they needed

A/B tests, or split tests are a method of comparing two states of one element. They help increase conversion rates and profitability of the project. A key metric when evaluating these tests’ efficiency is revenue. Anything can be tested via A/B method: subscription screen, onboarding, ad creative, etc.

The history of A/B testing

A/B tests have become popular in the era of websites, when their owners were chasing higher conversion rates. They were changing the pages’ elements in order to make them more appealing to visitors.

It looked like this: you had to implement the Google Optimize tag, then tested elements were configured in the visual editor and the experiment was launched. The users were randomly divided into two groups. One of them was getting the old version of the element and the other one — a changed one.

Over time, A/B tests have become more complex and improved. With the development of the mobile market, A/B tests began to be used for testing mobile app elements as well. But most developers avoid conducting A/B tests because they are considered to be long, expensive, and time consuming. Let’s look at some of the most common objections from developers who reject the idea of A/B testing without even trying it.

Handling objections

A developer thinks he knows how to improve the app without testing

This is the most common delusion. The developer thinks he can himself fix all the flows of the app and the metrics will drastically increase. But the thing is that even from trends there are too many exceptions — therefore, there are no universal solutions.

The developer believes he can simply compare “before” and “after”

The developer is convinced he can make changes one by one and compare the results. But it takes time — say, a week, to notice the effect of changes. A problem arises here: a lot can be changed throughout this week. Not only are metrics influenced by the changes you made, but also they are affected by new competitors, the UA strategy, etc. The reality is variable — thus, two elements should be compared simultaneously.

A/B tests are time consuming and expensive

This is a very logical point. Deep app components are often being tested in mobile games. Conducting only one A/B test would be just not enough to determine the optimum choice. To do so, you will need to run multiple tests. Therefore, stock up on money, time and patience.

Still, it is worth it. An example from AppQuantum’s practice: one subscription-based app worked at a loss for half a year, which at its peak was -$300 thousand. After about 50 A/B tests of onboarding and paywall the app became profitable, recouped the investment, and now brings a stable income of hundreds of thousands of dollars. And all thanks to A/B testing.

Therefore, the only honest response to the objection that A/B tests are time consuming and expensive — yes, they are. But these costs have to be budgeted from the very beginning.

Formulating correct hypotheses

A preparatory stage for effective A/B testing is formulating a correct hypothesis. Each hypothesis is designed to affect a specific metric.

Let’s assume, you are aiming to increase the revenue and in order to do so you are working with retention rate. We formulate hypotheses, test them multiple times, then formulate again — but the metric stays at the same level. In this case the metric is considered to be “‎inelastic”.

Determine your expectations from hypotheses within the team: what are these hypotheses supposed to affect? Explore your user behaviour and product features, determine the desirable uplift. All this will help you formulate hypotheses that will actually affect the metrics. This will save your time, resources and money. In the next section we will explain how you can conduct the most effective tests with minimum expenses.

The secret to cheap A/B tests

1. Statistical significance

You can run cheap A/B tests if you know how to use statistical significance tools.

For instance, you are testing paywalls: take into account only those users who proceeded to the paywall — the rest of them are irrelevant. You have 2000 users divided into two groups and 2 variations of the tested element. Group A got 140 conversions and Group B — 160.

As a result, the difference between these two variations is too small, you cannot evaluate how good these improvements were. Here statistical significance helps you identify how many users theoretically would be affected by the change. Special calculators will help you make these calculations. You will find links to them below.

Calculators digitise a running test and make ongoing results clear. They use one of two approaches: frequency-based (or empirical) approach and Bayesian inference. The first one highlights one best option out of all those tested. The second one allows you to specify how much one of the options is better than the others in percentage.

Let’s try to interpret the test result using a classic frequency-based calculator.

In the example for 1000 users, two results do not differ much — it seems impossible to draw a clear conclusion. Let’s insert the same data to the calculator using the Bayesian inference.

This calculator gives a single answer: option B is 89.5% more successful than A. With the Bayesian inference, we reduce time and number of test iterations — therefore, we save money. It is important to understand that we do not use a bunch of calculators until we get a satisfying result. The outcome is the same in both cases, but the interpretation of the data is different.

2. Radical A/B testing

Radical tests help you achieve maximum test efficiency with minimal investment. In this kind of test the tested options are as far as possible from the control one.

When the product team begins working with tests, they start testing the closest variations of the element to the current one. When testing offer prices, where the initial cost is $4, they often test $3 and $5. We recommend not to do so.

We believe the most optimal way of conducting tests is to do it radically. The control cost of the offer is $4 — set the most distant values of $1 and $10.

Benefits of radical A/B tests:

They are revealing. Radical testing has an extremely positive or negative effect, so it is easier to assess the effect from the changes. Even if the test gave a negative outcome, you get an understanding in which direction to move, whereas trivial tests inspire the illusion that an optimum is found. The $5 option lost to the $4 option — it could be regarded as meaningless to test even larger costs, they definitely won’t win. In our experience, it doesn’t work that way.
They give an opportunity to save money. There are more iterations in the radical test, but they are cheaper to perform and achieve the desired significance with fewer conversions. We need less data to draw a certain conclusion.
Lower error probability. The closer the tested variations, the higher the chance of randomness.
A chance for a pleasant surprise. Once in AppQuantum we had been testing an unreasonably high offer price — $25. Our entire team and the team of our developer partner were convinced this was too expensive and no one would buy that offer at this price. Competitors’ similar offers cost a maximum of $15. But all in all our variation won. Pleasant surprises happen!

3. Degrading tests

The fastest and cheapest method of hypothesis approval is running degrading testing. The point is that instead of improving an element, we make it worse or exclude it from the app entirely. In the vast majority of cases, it is easier and faster as a developer must spend money, time and effort to develop a good solution that will not guarantee a raise in the metrics.

Let’s look at an example:

From the user feedback, the developer found out that the app has a poor tutorial. He believes that he will fix all the flaws and the metrics will suddenly drastically increase. The thing is: if the improvement is intended to increase the metrics — consequently, deterioration will also decrease them. If so, why shouldn’t we worsen the tutorial and see how much the metrics are resistant to change? After the developer makes sure the changes make sense, he can start improving the element, investing time, effort and money in it.

However, when the element under testing is made at an average level or even badly, it is better to not implement these changes at all. In case you do so, even degrading tests will give a positive result.

When it is worth running degrading A/B tests:

Narrative and quality of localisation. Exception: narrative-driven game genres and app categories.
User interface. Offers, paywalls and stores.
User experience design. But in this case degrading testing will affect users’ activity and their feedback on the app.
Tutorial and onboarding. Changes to them are effective, but often not as significant as expected.

4. Testing multiple variables

It is ineffective to test multiple variables at once. There is no sense testing 10 app elements simultaneously if the result of the majority of them is obvious. But there is an exception to every rule: sometimes you can test many variations in order to save money.

When it is worth testing multiple changes:

We know that only together these changes work effectively.
We are sure no change will give a negative result.
It is easier to simultaneously test several elements that are inexpensive to design.

We just explained to you how to run cheap tests. Now let’s try to figure out how to calculate accurate results in order to reduce the time and effort spent on them.

Unit economics in A/B testing

When looking for the most promising places in your product funnel, we recommend using unit economics methods. It helps determine the profitability of a business model based on revenue per product or customer. In mobile apps, revenue consists of app sales/subscriptions and ad revenue.

The scheme shows an example of a real app: 4 hypotheses and the calculation of their profit using unit economics. We have eliminated intermediate metrics and left the main ones: User or Lead Acquisition, Conversion Rate, Average Price, Lifetime Value, Cost per User, and Profit.

Hypothesis 1

You are aiming to increase the customer conversion rate. You will achieve an increase of 0.5% — this will bring us $945.

Hypothesis 2

You intend to increase the repeat purchases rate. 20 customers make 24 purchases, and we need 20 customers to make 31 purchases. When you clearly understand the aim and focus on specific metrics, you are able to precisely outline the range of possible hypotheses. What were we going to do in our case? Probably, we would have run a push-campaign or to give a player less lives. As a result, we will get a profit of $1,028.

Hypothesis 3

Now we want to double the volume of acquired traffic. The profit for this is $512. It turns out this will bring us significantly less money compared to the first and current profit.

Hypothesis 4

At some point of exploring abilities of unit economics, a product team comes up with an idea to design something completely different from what ever existed in the app. For example, to create a big brand new feature that will increase several metrics at once and will bring $6,737. At first glance, it seems the only problem is that the development of this feature will take at least 3 months. However, there are even more problems that we will consider further.

Super features’ MVP

A big feature is not the most efficient solution regarding metrics increase. The product can become too complex and difficult to understand. It could not bring desirable results, and you would waste a lot of time, resources and money. Consequently it could even appear that the main problem of your product had nothing to do with this feature.

Suppose, you still have a great desire to design this super feature for your app. But you need to have a guarantee you will have positive results. How to get this? We’ll figure this out.

Sequence of actions:

Defining the biggest bonus and risk features;
Asking why this superfeature could fail and succeed;
Estimating whether the bonus is worth the possible risks at all;
Determining the minimum implementation in order to receive a bonus;
Formalising the assessment of the bonus and risk in the test;
As a result, comparing the variation not only with the control group, but also with alternative ones.

Super features’ MVP: Doorman Story case

Let’s take as an example the Doorman Story app published by AppQuantum. This is a time manager simulation mobile game where a player develops their own hotel. The employee at this hotel is required to serve visitors in a limited time. Each game mechanic has a timer and sequence of actions.

We questioned: what if we start selling mechanics that a user is able to open in other games for free? This is a profitable monetisation point: you no longer need to produce new content, but you can already sell existing ones. To try this hypothesis, it was necessary to change the levels balance, remake the possibilities of obtaining mechanics and think over the interface for their opening.

Together with our partners from Red Machine (Doorman Story’s developer), we came up with the simplest version of the super feature. We designed a unique paid game mechanic of a chewing gum machine which does not affect the entire game balance and is found only in one set of levels. One mechanic, one art — the app economy is not disturbed.

This example fits into the method of the super features’ MVP: we highlight the main bonus and the main risk. The bonus is that we get an opportunity to sell what we already have without extra investment. Risk — the player may be scared away by the fact they no longer have all the tools for free, some must be bought additionally. Because of this, the user may even leave the product.

We have decided what maximum and minimum we expect from this feature and how much the implementation will reflect the idea. We are answering the questions: will what we do be representative? Can we predict the effect from it? Could we draw a conclusion from the experiment?

Still, we can change the hardness of the paywall at any time and see if people will continue to play. If this feature was bought only once, we are not able to say exactly how many times it will be bought in future. But we can predict paywalls.

In order to make sure that this mechanic will be in demand, we had to put the user under conditions where they could not help but buy it. That is why we were conducting the test already with a paywall. It helped us figure out how much the gum machine is a valuable purchase for a player. This way the risk is measured even better. If users could see the opportunity to make a purchase at a low price but do not do so and leave the game, most likely they do not like gameplay. However, if they leave facing the paywall, they probably are not satisfied with the price.

Now, we have decided what metrics we are going to analyse and what we are expecting from them. We have formulated three feature hypotheses that we are going to implement into Doorman Story. We came up with the MVP of these features, which we are able to design in a maximum of 2 weeks. We tested these features and got an unexpected result. According to the test, the best feature is the one no one from the team was expecting to win. In fact, the most simplified mechanic became the winner.

If we relied on intuition and did not conduct the test, we would definitely have a less successful result. A/B testing made it possible to find the optimum as soon as possible.

Test Risk Assessment

When launching a test, we must always assess the possible risks and know what can “break” in each group. For example, with an increase of user conversion rate, the rate of in-app purchases decreases. That means a potential user has proceeded onboarding, but shortly afterwards left the app. Because of this the app has a low retention rate. Although there are actually more users, we earn less from them.

Or, vice versa, we test the number of ad impressions in the app. Because of the fact that there are ads on each screen, the user deletes the app in the second session and does not return to it any more. Therefore, it is crucial to find a counter-metric we are also going to monitor for changes during the test.

If our point is that something might ruin the user experience in the app, it is best to track conversions by progress and engagement. It directly affects what we are afraid of. This works, for example, with the paywall.

Test results interpretation

Let’s move on to interpreting the test results. In order to objectively count them, we first must split the audience into segments.

Segmentation options:

By demography. The audience is commonly split by country or gender + age. This factor decides what traffic sources you should use for the campaign;
By payers. If there is enough data, we make several segments;
By new and old users. But if possible, it is worth testing only new users;
By platform and traffic source.

It is important to filter out those users who were not affected by the changes. If we are testing an element for users reaching the 7th level — of course, we need to take into account only those who have not reached this level. Otherwise we will get incorrect metrics. We also cut off everyone who did not pass the funnel, and any anomalies.

Problems with A/B tests in mobile apps

Conducting tests in mobile is a lot longer, more complicated and expensive than on the web. Problems begin at the stage of the build approval in the app store. In order to add a change, you need to completely reload the build. This means, a lot of time is spent on moderation. And besides the time consumption, there are monetary costs: you need to spend money for the purchase of traffic, design, test control and processing.

Having the latest version of the app is another obstacle. Not all of your users may have the current app version — not all of them will go through the test.

Moreover, there is a “peeking problem” in A/B testing. The point is that the product team draws premature conclusions and ends the test ahead of schedule.

For instance, the app developer has a favorite tested variation that should win, as he sees it personally. To confirm his assumption, he decides to calculate the hypothetical best option using the statistical significance method. After counting, he can see that his favorite version really has a high chance of winning. Based just on this information, he stops the test early to reduce time and cost of the test. But, technically, the test has not finished yet, because not all users had time to go through the tested item.

This decision is often made too suddenly. Yes, we can predict an approximate test result, but this does not always work. The conclusion could have been different if the developer let all the users allocated for the test version go through the test. Therefore, you may “peek” at the test results, but you should always wait until the entire sample of users passes through it.

You also need to take into account the demographic point: different options win in different countries. Therefore, it is not yet certain that the best option in the US will win in Russia as well.

In addition to the above, one of the key problems of tests is changing reality. That is why an option that was winning yesterday is not necessarily winning today.

Bayesian Multi-armed Bandit

Let’s move on to the most convenient, modern and fastest type of A/B tests: the so-called “Bayesian multi-armed bandit”. It represents tests updated in real time based on variation efficiency. The most effective group gets the largest share. To put it simply, it is auto-optimisation.

As we can see in the picture, if Option A wins, the next day we increase its share of users. If on the third day we see that it wins even more, we increase its share further. As a result, the winning option gets 100% of users.

The Multi-Armed Bandit adapts to the changes. It saves time you spend on self-segmenting users and analysing it. It also helps avoid the peeking problem. But the most important thing is the automatic tests design, that are expensive and difficult to implement in any product business.

Yes, A/B testing is a lot of manual labor. But if there is a mechanism that automatically detects good and bad options, then it will help to create automatic tests. Appbooster is currently working on creating a similar mechanism that will greatly facilitate mobile app A/B testing.

A/B testing by Appbooster

New functionality has been released on the Appbooster platform. This mechanism allows you to test any part of the app.

How it works: you have a personal account where you can launch new tests. You integrate Appbooster’s SDK into your app and design the variations of the item which are under test. If it is a paywall, the developer programs several samples at once.

Next, the build is added to the store, and the app receives a tested version. When the user opens the app, the Appbooster’s SDK requests for available experiments. This may depend on the user’s GEO for instance.

For example, we are testing two paywalls: blue and red. The app receives a list of available experiments and our user could be given a blue paywall. They interact with it: make a purchase or leave.

Appbooster collects statistics in your personal account, draws conclusions using the statistical significance calculator and implements changes to the basic version of the product. With paywall B, conversion from all traffic sources grows — thus, we set it as the default. After that, we move onto testing the next element.

New Appbooster functionality determines available experiments and gives users the best options. But the product team — the producer, marketer or app owner — needs to make decisions about the success of the specific test themselves.

Getting ready for A/B testing

You are ready for A/B testing if you:

Embedded analytics and tracking in the app.
Understand how much one user in your app costs and whether you can scale it.
Have resources for constant hypothesis testing.

Summing up: Quick, cheap and easy A/B tests are not a myth. You can do so if you know how to. And this must be done systematically. The point is not to run the test once and forget about it forever. You will get a qualitative effect only through chains of changes.

If you want to improve your mobile app, but do not know what to start with, get in touch with AppQuantum. We will conduct A/B tests for you, set up analytics and guide you for the future.