Skip to content
Home » News » At the Heart of a Software Publisher’s Daily Life: Fixing a Performance Bug in Droid with the Help of the Digital Preservation Community

At the Heart of a Software Publisher’s Daily Life: Fixing a Performance Bug in Droid with the Help of the Digital Preservation Community

    Droid is an open-source library for format identification (https://www.nationalarchives.gov.uk/information-management/manage-information/policy-process/digital-continuity/file-profiling-tool-droid/), nearly indispensable in the toolkit of any digital preservation software. For this reason, we have integrated it into the Arcsys software.

    When a client contacted us about degraded performance during the archiving process, we did not suspect that it would lead to an epic journey… which would spark lengthy debates among Droid contributors about the mechanisms of format identification and benefit the entire community!

    Searching for the Weak Link

    The client opened a ticket explaining that the file archiving times were far too long compared to expectations. These were very large files, reaching several terabytes. However, the files were getting stuck for dozens of hours during a specific phase of the process that always seemed to be the same: the format identification phase.

    Naturally, the first question was: what types of files were involved? The answer was that they were exclusively compressed files in ZIP format.

    Armed with this information, we tackled the essential step that every software developer knows: reproducing the problem.

    Anomalous Bandwidth

    We began testing the archiving of multi-gigabyte ZIP files. We quickly identified the source of the slow processing: it was indeed the format identification step performed by the Droid library. Using a Java profiling tool to measure read bandwidth, we discovered that the slowness was caused by the entire file being read twice!

    To our surprise, we found that when running the same file through Droid’s graphical tool, DroidUI, only a minimal part of the file was read, and the result was produced much faster.

    Opening an Issue on GitHub

    Armed with these concrete findings, we opened an investigation request on Droid’s GitHub (https://github.com/digital-preservation/droid/issues/906) while continuing our research. We realized that the problem revolved around the combinations of “binary signature files and container signature files.”

    To summarize, Droid uses a binary signature file to identify a file type, which describes the specific bit sequences characteristic of each format. The complexity arises from “container” files, such as ZIPs, which also have their own signature files. These signature files evolve over time with new versions (as contributors add new formats, for example).

    In our case, we discovered that depending on the combination of binary signature file versions and container signature file versions, the slowness issue disappeared. This explains why the DroidUI graphical client didn’t exhibit the anomaly: it used a different combination! Working with the community, we were able to pinpoint the problematic entries, and the anomaly was fixed by “cleaning up” the signature files. Meanwhile, we implemented the fix for our client, who was relieved to see the performance issue immediately resolved.

    A Source of Debate and Reflection

    What is particularly interesting is that this sparked debates and reflections. Andy Jackson, a technical architect at the Digital Preservation Coalition, discussed this episode on his blog (https://anjackson.net/2023/03/21/speeding-up-format-identification/#signatures-going-wild; https://anjackson.net/2023/03/22/my-format-identification-misunderstandings/). Andy reflected on a suggestion by Martin Hoppenheit (https://martin.hoppenheit.info/blog/2017/minimizing-the-droid-signature-file/), proposing to “customize” signature files according to the client’s context to optimize performance. For example, if a client’s archive only contains TIFF and PNG files, why include other formats in the signature files, penalizing performance?

    After reflection, Andy Jackson concluded that this might not be a good idea. Such an approach risks making binary and container signature files incompatible (especially with future updates). He emphasized that it is Droid’s responsibility to avoid combinations that cause a “full scan” of files.

    Key Takeaways

    Here are a few lessons I personally drew from this correction process:

    • If there is slowness during the format identification phase, check the read bandwidth and ensure that files are not being read entirely multiple times during this step—this is rarely normal!
    • More generally, format identification is not “magic” and can be costly, especially if the files are large. It’s important to be aware of this and consider its actual utility in the client project context.
    • I was impressed by the responsiveness of the digital preservation open-source community, which quickly resolved the Droid anomaly.
    • I discovered the passion driving this community, thanks to Andrew Jackson’s blog. This aligns with my article on iPres (https://arcsys-software.com/2024/10/15/feedback-from-fripres22/).

    This story had two goals. First, to describe the daily life of a software publisher’s development teams in fixing a performance issue reported by a client, and second, to highlight the power and dedication of open-source communities, particularly in digital preservation.

    P.S.: I would like to thank my colleague Raphaël Lample, who led most of the technical investigation described in this article and guided the community toward the root of the problem.

    Mikaël MECHOULAM