INESC TEC researchers developed the LazyFS tool, capable of injecting faults and reproducing data loss bugs. The solution helps to understand the origin and cause of said bugs, but also to validate protection mechanisms against failures.
The dictionary tells us that the term “bug”, used to describe software errors, also means “insect”. Small creatures that, depending on their constitution or number, can cause significant disruptions to human life. In the digital context, where the term is also used, the meaning does not differ from the original. The existence of errors in software can have consequences ranging from abnormal functioning to complete inoperability. Depending on how relevant programs are for users worldwide, disruptions can reach significant levels – and lead to major losses.
It is not surprising that science is committed to finding a way to identify, eliminate or, ultimately, reduce the consequences associated with them. “When Amnesia Strikes: Understanding and Reproducing Data Loss Bugs with Fault Injection”, by Maria Ramos, João Azevedo, José Pereira, Tânia Esteves, Ricardo Macedo, and João Paulo, researchers at INESC TEC, focuses precisely on these matters.
Throughout the research, the authors studied and reproduced bugs leading to information loss – especially in data-centric applications, namely databases and storage systems -, to automate and simplify said processes. They also tried to understand the cause of the bugs, relying on information about the operations carried out by the systems and identifying vulnerable data. Then, the researchers validated protection mechanisms against this type of bugs.
According to Maria Ramos, “the problems that development teams face when trying to reproduce and debug data durability issues – a process that consists of finding and correcting unwanted errors – represented one of the motivations for this work. Users often provide ambiguous reports without clear reproduction steps or with complex, manual actions”. The difficulty associated with reproducing bugs often leads users to suggest modifying the source code of applications, abruptly interrupting the computers’ power connection in very specific time windows.
The solution proposed in the paper, which also featured Jepsen’s Kyle Kingsbury – goes by the name of LazyFS, a software fault injection tool that helps reproduce data loss bugs and test resilience mechanisms. “It is a file system in user space that features its own cache, allowing to inject faults of total or partial data loss and to imitate the behaviour of the operating system (which is at the origin of these bugs)”, explained Maria Ramos. LazyFS can also be combined with bug exploration tools, e.g., Jepsen (a tool that tests the resilience of distributed systems and has already found several bugs in widely used systems).
Another relevant example of the tool showcased at the 2024 edition of the International Conference on Very Large Databases (VLDB) – A* ranking – is the bugs identified independently by LazyFS in etcd (a key-value database), which were promptly reported and confirmed. This fuelled the team’s interest in integrating the tool produced by INESC TEC researchers into their tests. The same happened with PostgreSQL and MongoDB, within the scope of databases.
The researchers mentioned in this news piece are associated with INESC TEC and UMinho.