How to Connect Dropbox and Android Camera Uploads

Camera uploads is a feature in our Android and iOS apps that automatically backs up a user'south photos and videos from their mobile device to Dropbox. The feature was first introduced in 2012, and uploads millions of photos and videos for hundreds of thousands of users every mean solar day. People who use camera uploads are some of our most dedicated and engaged users. They care deeply about their photograph libraries, and expect their backups to be quick and dependable every fourth dimension. It'due south important that nosotros offer a service they can trust.

Until recently, camera uploads was congenital on a C++ library shared between the Android and iOS Dropbox apps. This library served united states of america well for a long time, uploading billions of images over many years. However, it had numerous bug. The shared lawmaking had grown polluted with circuitous platform-specific hacks that fabricated information technology difficult to understand and risky to change. This risk was compounded past a lack of tooling support, and a shortage of in-house C++ expertise. Plus, subsequently more than five years in product, the C++ implementation was get-go to show its historic period. Information technology was unaware of platform-specific restrictions on background processes, had bugs that could delay uploads for long periods of time, and fabricated outage recovery difficult and time-consuming.

In 2019, we decided that rewriting the feature was the best way to offering a reliable, trustworthy user experience for years to come up. This time, Android and iOS implementations would be split up and utilise platform-native languages (Kotlin and Swift respectively) and libraries (such asWorkManager andRoom for Android). The implementations could then be optimized for each platform and evolve independently, without being constrained past design decisions from the other.

This post is nearly some of the design, validation, and release decisions we made while edifice the new photographic camera uploads feature for Android, which we released to all users during the summer of 2021. The project shipped successfully, with no outages or major issues; fault rates went down, and upload performance greatly improved. If you haven't already enabled photographic camera uploads, you should try it out for yourself.

Designing for background reliability

The main value proposition of camera uploads is that it works silently in the background. For users who don't open the app for weeks or even months at a fourth dimension, new photos should all the same upload promptly.

How does this work? When someone takes a new photograph or modifies an existing photo, the Bone notifies the Dropbox mobile app. A background worker we call the scanner advisedly identifies all the photos (or videos) that haven't yet been uploaded to Dropbox and queues them for upload. Then another background worker, the uploader, batch uploads all the photos in the queue.

Uploading is a two footstep process. First, like many Dropbox systems, we interruption the file into 4 MB blocks, compute the hash of each block, and upload each block to the server. Once all the file blocks are uploaded, nosotros make a concluding commit request to the server with a listing of all block hashes in the file. This creates a new file consisting of those blocks in the user's Camera Uploads binder. Photos and videos uploaded to this binder can so be accessed from any linked device.

Ane of our biggest challenges is that Android places stiff constraints on how often apps can run in the groundwork and what capabilities they take. For example, App Standby limits our groundwork network admission if the Dropbox app hasn't recently been foregrounded. This means nosotros might just exist allowed to access the network for a 10-minute interval one time every 24 hours. These restrictions have grown more strict in recent versions of Android, and the cantankerous-platform C++ version of camera uploads was not well-equipped to handle them. It would sometimes try to perform uploads that were doomed to neglect because of a lack of network access, or fail to restart uploads during the organisation-provided window when network access became bachelor.

Our rewrite does not escape these background restrictions; they still apply unless the user chooses to disable them in Android'due south system settings. All the same, nosotros reduce delays as much as possible by taking maximum reward of the network access we do receive. We utilise WorkManager to handle these background constraints for us, guaranteeing that uploads are attempted if, and only if, network access becomes available. Unlike our C++ implementation, we besides exercise as much work as possible while offline—for example, by performing rudimentary checks on new photos for duplicates—before asking WorkManager to schedule united states for network access.

Measuring interactions with our status banners helps us identify emerging issues in our apps, and is a helpful bespeak in our efforts to eliminate errors. Later the rewrite was released, we saw users interacting with more than "all done" statuses than usual, while the number of "waiting" or fault condition interactions went downwardly. (This data reflects only paid users, just non-paying users evidence similar results.)

To farther optimize use of our limited network access, we also refined our handling of failed uploads. C++ photographic camera uploads aggressively retried failed uploads an unlimited number of times. In the rewrite nosotros added backoff intervals between retry attempts, and besides tuned our retry behavior for different error categories. If an error is likely to be transient, we retry multiple times. If it's likely to be permanent, we don't bother retrying at all. As a issue, we make fewer overall retry attempts—which limits network and battery usage—andusers run across fewer errors.

Designing for performance

Our users don't only await camera uploads to work reliably. They also expect their photos to upload chop-chop, and without wasting system resource. We were able to make some big improvements hither. For example, first-time uploads of large photo libraries now finish up to four times faster. At that place are a few ways our new implementation achieves this.

Parallel uploads
First, we substantially improved performance by adding support for parallel uploads. The C++ version uploaded but one file at a time. Early in the rewrite, we collaborated with our iOS and backend infrastructure colleagues to design an updated commit endpoint with back up for parallel uploads.

Once the server constraint was gone, Kotlin coroutines fabricated it piece of cake to run uploads meantime. Although Kotlin Catamenias are typically candy sequentially, the available operators are flexible enough to serve as building blocks for powerful custom operators that support concurrent processing. These operators can be chained declaratively to produce code that's much simpler, and has less overhead, than the manual thread management that would've been necessary in C++.

            val uploadResults = mediaUploadStore     .getPendingUploads()     .unorderedConcurrentMap(concurrentUploadCount) {         mediaUploader.upload(it)     }     .takeUntil {         it != UploadTaskResult.SUCCESS     }     .toList()          

A uncomplicated instance of a concurrent upload pipeline. unorderedConcurrentMap is a custom operator that combines the built-in flatMapMerge and transform operators.

Optimizing memory apply
Later adding support for parallel uploads, we saw a big uptick in out-of-memory crashes from our early on testers. A number of improvements were required to make parallel uploads stable enough for production.

First, we modified our uploader to dynamically vary the number of simultaneous uploads based on the amount of available system memory. This fashion, devices with lots of retentiveness could enjoy the fastest possible uploads, while older devices would not exist overwhelmed. However, we were still seeing much college retentiveness usage than we expected, and so we used the retention profiler to take a closer look.

The first thing nosotros noticed was that retentivity consumption wasn't returning to its pre-upload baseline afterward all uploads were washed. It turned out this was due to an unfortunate behavior of the Java NIO API. It created an in-retention cache on every thread where nosotros read a file, and once created, the enshroud could never be destroyed. Since we read files with the threadpool-backed IO dispatcher, nosotros typically ended up with many of these caches, one for each dispatcher thread nosotros used. We resolved this by switching to direct byte buffers, which don't allocate this cache.

The next thing we noticed were large spikes in memory usage when uploading, especially with larger files. During each upload, we read the file in blocks, copying each block into aByteArray for further processing. Nosotros never created a new byte array until the previous one had gone out of scope, so we expected only one to exist in-memory at a time. Even so, it turned out that when we allocated a large number of byte arrays in a short fourth dimension, the garbage collector could not free them chop-chop enough, causing a transient memory spike. We resolved this upshot past re-using the aforementioned buffer for all block reads.

Parallel scanning and uploading
In the C++ implementation of camera uploads, uploading could not beginning until nosotros finished scanning a user'southward photo library for changes. To avert upload delays, each browse only looked at changes that were newer than what was seen in the previous scan.

This approach had downsides. At that place were some edge cases where photos with misleading timestamps could be skipped completely. If nosotros ever missed photos due to a bug or OS change, shipping a fix wasn't enough to recover; nosotros too had to articulate affected users' saved browse timestamps to force a full re-scan. Plus, when camera uploads was first enabled, we still had to check everything before uploading annihilation. This wasn't a great starting time impression for new users.

In the rewrite, we ensured correctness by re-scanning the whole library after every change. We also parallelized uploading and scanning, and then new photos can first uploading while we're still scanning older ones. This means that although re-scanning can have longer, the uploads themselves still start and finish promptly.

Validation

A rewrite of this magnitude is risky to ship. It has dangerous failure modes that might only show up at scale, such as corrupting one out of every million uploads. Plus, as with well-nigh rewrites, nosotros could not avoid introducing new bugs considering we did not understand—or fifty-fifty know about—every edge case handled past the one-time system. We were reminded of this at the start of the projection when we tried to remove some ancient camera uploads code that nosotros thought was dead, and instead ended upward DDOSing Dropbox's crash reporting service. 🙃

Hash validation in product
During early development, we validated many depression-level components past running them in production alongside their C++ counterparts and so comparing the outputs. This permit us ostend that the new components were working correctly before we started relying on their results.

One of those components was a Kotlin implementation of the hashing algorithms that nosotros use to place photos. Because these hashes are used for de-duplication, unexpected things could happen if the hashes change for even a tiny percentage of photos. For instance, we might re-upload former photos believing they are new. When we ran our Kotlin code alongside the C++ implementation, both implementations virtually always returned matching hashes, but they differed about 0.005% of the fourth dimension. Which implementation was wrong?

To answer this, we added some additional logging. In cases where Kotlin and C++ disagreed, nosotros checked if the server subsequently rejected the upload considering of a hash mismatch, and if so, what hash it was expecting. We saw that the server was expecting the Kotlin hashes, giving usa high confidence the C++ hashes were incorrect. This was great news, since it meant we had stock-still a rare issues nosotros didn't even know we had.

Validating country transitions
Camera uploads uses a database to track each photograph's upload state. Typically, the scanner adds photos in country NEW and then moves them to PENDING (or Done if they don't demand to exist uploaded). The uploader tries to upload PENDING photos and then moves them to Washed or Fault.

Since we parallelize so much work, it's normal for multiple parts of the organisation to read and write this state database simultaneously. Individual reads and writes are guaranteed to happen sequentially, but we're even so vulnerable to subtle bugs where multiple workers try to change the country in redundant or contradictory ways. Since unit tests just cover single components in isolation, they won't catch these bugs. Even an integration test might miss rare race conditions.

In the rewritten version of photographic camera uploads, we guard against this past validating every land update against a set of immune state transitions. For instance, we stipulate that a photograph can never move from Fault to Done without passing dorsum through Awaiting. Unexpected state transitions could bespeak a serious issues, and so if we see one, we stop uploading and report an exception.

These checks helped us find a nasty bug early in our rollout. Nosotros started to encounter a high volume of exceptions in our logs that were caused when photographic camera uploads tried to transition photos fromDONE toWashed. This fabricated us realize we were uploading some photos multiple times! The root cause was a surprising behavior in WorkManager whereunique workers tin restart before the previous instance is fully cancelled. No duplicate files were being created because the server rejects them, but the redundant uploads were wasting bandwidth and fourth dimension. Once we fixed the issue, upload throughput dramatically improved.

Rolling it out

Fifty-fifty after all this validation, we yet had to exist cautious during the rollout. The fully-integrated system was more complex than its parts, and we'd also need to contend with a long tail of rare device types that are not represented in our internal user testing pool. We as well needed to keep to meet or surpass the high expectations of all our users who rely on camera uploads.

To reduce this risk preemptively, nosotros made certain to support rollbacks from the new version to the C++ version. For instance, nosotros ensured that all user preference changes fabricated in the new version would utilize to the old version as well. In the end we never ended up needing to roll back, but it was still worth the attempt to have the option available in case of disaster.

We started our rollout with an opt-in puddle of beta (Play Store early access) users who receive a new version of the Dropbox Android app every week. This puddle of users was large plenty to surface rare errors and collect fundamental performance metrics such every bit upload success rate. Nosotros monitored these key metrics in this population for a number of months to gain confidence information technology was gear up to send widely. We discovered many problems during this time period, merely the fast beta release cadence immune usa to iterate and fix them chop-chop.

Nosotros as well monitored many metrics that could hint at future issues. To make sure our uploader wasn't falling behind over time, we watched for signs of ever-growing backlogs of photos waiting to upload. We tracked retry success rates by error blazon, and used this to fine-tune our retry algorithm. Final but not least, nosotros likewise paid close attending to feedback and support tickets nosotros received from users, which helped surface bugs that our metrics had missed.

When we finally released the new version of camera uploads to all users, it was clear our months spent in beta had paid off. Our metrics held steady through the rollout and we had no major surprises, with improved reliability and depression error rates correct out of the gate. In fact, we ended up finishing the rollout ahead of schedule. Since we'd front end-loaded so much quality comeback work into the beta period (with its weekly releases), nosotros didn't have whatever multi-calendar week delays waiting for disquisitional bug fixes to ringlet out in the stable releases.

And so, was it worth it?

Rewriting a big legacy feature isn't ever the right decision. Rewrites are extremely time-consuming—the Android version lone took 2 people working for two full years—and can easily cause major regressions or outages. In order to be worthwhile, a rewrite needs to deliver tangible value by improving the user experience, saving engineering time and effort in the long term, or both.

What communication practice nosotros accept for others who are beginning a project like this?

  • Define your goals and how you volition measure them. At the commencement, this is of import to make sure that the benefits will justify the attempt. At the end, information technology will help yous determine whether you got the results you wanted. Some goals (for example, future resilience confronting Os changes) may not be quantifiable—and that's OK—but it's good to spell out which ones are and aren't.
  • De-take chances information technology. Identify the components (or arrangement-wide interactions) that would cause the biggest problems if they failed, and baby-sit against those failures from the very start. Build disquisitional components first, and try to test them in production without waiting for the whole organisation to be finished. Information technology's as well worth doing extra work up-front in order to exist able to curl back if something goes incorrect.
  • Don't rush. Aircraft a rewrite is arguably riskier than shipping a new feature, since your audition is already relying on things to work as expected. Start by releasing to an audition that's but large enough to requite you the data you lot need to evaluate success. Then, watch and wait (and fix stuff) until your data give you confidence to keep. Dealing with problems when the user-base is small is much faster and less stressful in the long run.
  • Limit your scope. When doing a rewrite, it'due south tempting to tackle new feature requests, UI cleanup, and other backlog work at the aforementioned time. Consider whether this will actually be faster or easier than aircraft the rewrite get-go and fast-following with the rest. During this rewrite we addressed issues linked to the core architecture (such as crashes intrinsic to the underlying data model) and deferred all other improvements. If you change the feature too much, not only does it accept longer to implement, simply it's also harder to notice regressions or scroll back.

In this case, we feel good about the decision to rewrite. We were able to ameliorate reliability correct away, and more chiefly, we set ourselves upwards to stay reliable in the future. As the iOS and Android operating systems continue to evolve in split directions, it was only a matter of time earlier the C++ library bankrupt badly plenty to require fundamental systemic changes. Now that the rewrite is complete, we're able to build and iterate on camera uploads much faster—and offer a ameliorate feel for our users, too.

Likewise: Nosotros're hiring!

Are you a mobile engineer who wants to make software that's reliable and maintainable for the long haul? If then, we'd love to have you at Dropbox! Visit our jobs page to see electric current openings.

waylandleat1957.blogspot.com

Source: https://dropbox.tech/mobile/making-camera-uploads-for-android-faster-and-more-reliable

0 Response to "How to Connect Dropbox and Android Camera Uploads"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel