The TextNow app has a number of premium features that make the user experience more engaging, allowing users to do something above and beyond the default app experience, such as: removing ads, getting a premium number, locking-in a number, getting a more comprehensive caller ID, voicemail transcriptions, international calling, and more!
We have 4+ different subscriptions, and each defines a set of features applied for a defined period of time. We keep each set of features as relevant and small as possible so users don’t have to pay for features they don’t want. We also have consumable products such as wallet credits that users can spend on international calls. (Read more about the in-app purchases (IAP) we offer for Android or iOS.)
At TextNow, we have over 87,000 active subscriptions — and growing! And as our subscriber base grows, the services that handle purchases have become a very important component of our architecture. The initial design was tightly coupled, and as we grow, it has become important to bring some flexibility to the system.
But before we dive into the why — the challenges we were facing with the initial design, how we solved them, and the results — let’s explain some basic terms first.
A capability is a single feature that allows the user to do something above and beyond the default app experience. Examples of capabilities that TextNow offers include:
- Removing ads: Perhaps the most popular capability is ad removal. While TextNow’s free service is supported by ads, we also grant the option to get rid of them.
- Locking-in your number: Avoid the risk of losing your number as a result of inactivity by locking in your number, even if you don’t use the account regularly.
- Premium number: Users can pick a premium number. Premium numbers are easy-to-remember phone numbers, usually consisting of similar repeated digits.
- Voicemail transcription: So you can read any missed calls on-the-go.
The capability service exposes a read-only endpoint that all clients, along with the care team, will interface with to get a consistent view of what the client is able to do. This service does not allow for changes to be made directly to the user capabilities — the bundle service owns that.
A bundle is a set of capabilities that will be applied to a user for a defined period of time. Examples of these bundles include:
- Ad Free+: Remove ads, lock-in a number, get caller ID, voicemail transcription, and more!
- Ad-Free Lite: It takes out the ads from text conversations but leaves the banner ad at the bottom.
- Premium Number: Get a premium number and lock it in.
A user can have more than one bundle active at a time; a bundle can include one or more capabilities to be granted to the user.
The next logical question is, how will users activate these bundles? Well, there are a few ways through which users can purchase the bundles:
- IAP (In-app purchases): Purchase a subscription from inside the Android or iOS app and access a set of premium features, with a defined cost, and a period of time. Subscriptions map to bundles; when a subscription is purchased by the user, the bundle will become active.
IAP also includes consumable products such as wallet credits users can spend on international calls, but these do not map to a bundle. They do not get expired — they get consumed, and the credits go directly into the user’s wallet.
- Rewarded Videos: Users can watch a rewarded video (a short ad) to earn credits that go into their wallet, and use the credits to redeem the Ad Free+ bundle for a period of time.
We have chosen to implement these two concepts in a pair of microservices. The IAP service talks with the bundle service to grant or revoke (when expired) subscriptions, and with the wallet to deposit credits upon purchase. Bundle service, upon redeeming a reward, talks to the wallet service to withdraw credits and activates the redeemable bundle (see conceptual model diagram above).
The motivation, the challenges, and the reasons behind revamping our core monetization services were as follows:
- The initial design was tightly coupled to one bundle: We had one column in the “users” table that determined whether the user has the Ad free+ bundle enabled or not, and for how long. Not surprisingly, this lacked the flexibility to support more than one bundle, changes in capabilities, etc.
- Clients were managing the purchases life-cycle (purchased, renewed, canceled, expired, etc). This led to cross-platform duplication, inconsistent behavior, and 3x engineering effort (some business logic needs to be implemented on the server-side anyway!). Moreover, it was not uncommon to see issues where subscriptions were not processed correctly due to temporary failures and inconsistent logic across the implementations.
- We didn’t have a monitoring system to detect anomalies, fraud, and successful and failure events. Malicious users repeatedly refunding purchases (eg wallet credits) after using them is a common example. In addition, as data was not being analyzed, we did not have insights into what makes users cancel, renew, refund, or purchase in the first place.
- As a result of querying third-party APIs, these services are prone to temporary errors that cause subscriptions not to be verified, and hence, the user won’t be granted the associated bundle. Handling failures and retrials improve user experience, increase retention rate, and reduce support costs.
- Tracking the history of purchases and user activity gives us more visibility and allows us to provide better customer support. Users also get the advantage of knowing about their active bundles, when it was activated, and their expiry dates.
- As we grow, product and marketing teams want to have the flexibility to define and adjust bundles, their pricing, and durations, without requiring engineering support.
How we went about solving our issues, the roadblocks we faced along the way, and what the outcome was, is what we’re going to uncover next.
First, we wanted to give clients clear APIs with a set of separate routes that were designed so that each route had a single, specific action.
For example, in the IAP service, we split the routes into two sets of routes for Android and iOS clients. And for each client, we defined a route for purchasing a subscription, another one for consumable products (wallet credits), and so on.
Why? Well, first, each route has a separate flow; Handling renewable subscriptions is not the same as consumable products, for both BE (backend) and clients. We are now able to track, monitor, secure, debug, test, and release changes to one flow without impacting the others. And as a side note, BE now handles the purchases’ life cycle; clients no longer have to deal with that. In addition, separate routes allow us to get the user activity history used internally by the care team for support.
We are going to assume by “vendor” we mean “App Store” and “Play Store”. They handle the purchase payment, give us an API to verify purchases given their IDs, and allow us to receive real-time notifications whenever there is a change in a user’s purchase state (eg subscription canceled).
Along the way, we ran into a few painful situations that we didn’t expect to have to deal with, including:
- The ambiguous, less-informative, documentation around how these APIs work. This consumed a lot of time and effort trying to figure out how their APIs respond under different situations, different environments (prod vs sandbox) and match it with our intuition.
- Some API updates, including pending deprecations, were not proactively communicated by the API vendor and required frequent reviews of their API documentation to catch
- It is no wonder that these third-party APIs go down, and timeout resulting in not being able to verify purchases, and therefore, users don’t get the purchased features right away causing some escalations from care and bad customer experience.
- The inconsistency between App Store and Play Store forced us to write additional code, classic workarounds, cron jobs, and extra testing. For example, one vendor might not send us a notification for when a subscription gets renewed or refunded, therefore, throwing the burden on us to poll their API quite frequently.
Access patterns and Datastores
TextNow uses a mix of relational and NoSQL databases depending on the access patterns, and the flexibility around querying the data.
In the IAP service, for instance, we split Reads operations from Writes, a typical CQRS pattern, essentially, using a relational database for writes and NoSQL for Reads.
Two relational tables were created, one for App Store and one for Play Store, each has all purchases (active and expired, subscriptions and consumable products). Each row represents the most recent state of user purchase. These two tables are mainly used to manage and track purchases. It is, therefore, optimized for updates (eg updating purchase status), querying by purchase ID, in addition to the cron jobs that scan the indexed table to sync our data with the app or play store periodically.
The access patterns were not only unclear initially but they might also change as the vendor’s API changes too. For example, if we ever wanted to scan the table to get a list of all recently expired purchases — assuming vendor API stopped notifying us about auto-renewed purchases — in a relational table, it is a matter of adding an index. The need to be more flexible and accommodate changes led us to choose a relational model.
And as a final thought, multiple concurrent insertions and updates are assumed to be safe for concurrency in many relational DB engines. This avoids granting the same feature twice and detects duplicate requests. And as far as the scaling is concerned, we found that they can scale 10x our current capacity.
On the other hand, another DynmoDB (NoSQL) table is optimized and designed for querying the life-cycle of purchases. Every single change such as purchasing, renewing, canceling, and refunding a purchase gets stored as an immutable event, sorted by date. The access pattern here is clear and straightforward: Get a list of historical changes by user id, and optionally, during a given period of time.
For the capability, bundle, and wallet services, we decided to use DynmoDB with the Optimistic Locking with Version Number as a strategy to protect the database writes from being overwritten by the writes of others, and vice versa. The access patterns here are very simple lookups by the user id: Get/Update user bundles given the user the id.
The same DynamoDB table can also be used to store historical data (a list of user activities). In this case, the PK (partition key) is always the user id but the SK (sort key) can be different to support different access patterns:
- Get/Update user bundles given the user the id → SK is the bundle id.
- Get a list of user activities given the user the id → SK is the date.
For those interested in learning more, Rick Houlihan has a great talk about the advanced DynmoDB patterns and data models.
External third-party APIs fail. That’s not uncommon. All of our requests to third-party APIs are wrapped by a backoff exponential retrial. If the request is still failing after all retrials, we insert the request JSON as is in a dead queue backed by a relational table. A cron job runs periodically, scans this table, and runs the failed requests against the API again.
For requests coming from clients, clients will also retry sending the failed request upon receiving a temporary failure error (i.e. internal failures).
Streaming and Analyzing data
We rely heavily on DynamoDB as a fast, scalable, key-value NoSQL database.
As mentioned earlier, all historical data gets stored in DynamoDB for tracking and debugging purposes. But what about shipping these real-time user activities and storing them for analysis? We use DynamoDB Streams and AWS Lambda Triggers to stream DB changes to Lambda, and from there, we send it to our Kafka cluster which eventually gets stored in our data warehouse Redshift.
Real-time monitoring is essential to understand user behavior, detect anomalies, fraud, successful/failure events, and take immediate actions if needed.
We ship metrics and stats from the BE services to DataDog. Success rate, error rate, latency, and anomalies monitors around the number of purchases, cancellation, renewals refunds, and data inconsistencies, duplicate purchase submissions, and alerts when the vendor API goes down.
Furthermore, the data we ship to our analytics pipeline help us understand the business decisions (eg adding a new feature) vs user behavior over time. It also helps in aggregating user data, correlates it with other relevant data, trying to spot any malicious activities, and answer questions such as why some users cancel after purchasing.
We have experienced huge improvements in:
We significantly improved engineering efficiency as we reduced duplicate effort across clients, and allowed client engineers to focus on just building views rather than taking care of the business logic and dealing with vendor APIs.
Changes are done only once in the BE and reflect across all clients.
Since the business logic is done on the BE, this eliminates the possibility of inconsistent behavior across clients.
And since the API responses encourage consistent design across all client apps, users get a consistent UI design across all platforms. To illustrate, all clients get the same list of bundles and capabilities associated with a user and display it as is; they do not need to do any business logic (eg verify or calculate subscription expiry).
It is now easier and faster to add, adjust, and remove bundles and capabilities without requiring any additional work.
For clients, they can support all future changes given that the response payload is the same.
For BE, allowing the product team to define and create bundles and their set of capabilities through an internal API requires almost zero engineering effort.
Monitoring, analytics, and handling failures
Using our monitoring and analytics methods we are able to get more visibility into the user activities, help the product team to make informed decisions, and detect & handle suspicious activities and failures as early as possible.
Currently, clients need to poll our BE API to get the most recent list of the user bundles. So, for instance, when a bundle status gets revoked (becomes inactive), clients don’t get to know about this unless they query the API, which moves from the “poll” model to the “push” model by sending real-time notifications from BE to the clients using WebSockets.
Some mitigations require human intervention. For example, even though we get notified upon detecting malicious activity, the team still needs to decide what to do about it. It will become much easier if we were able to take real-time actions based on the historical data to meditate the impact and reduce the need for human interventions.
Big thanks to the team of developers and product who consistently work hard on improving the user experience at TextNow. It has been a pleasure being part of this journey and writing this article but the real credit goes to Robert Robinson, Jason Spangenthal, Severn Tsui, and Scott Henderson.