Software Debugging Challenges and Modern Debugger Architecture Design

Challenges of Software Debugging in Real Environments

In modern software development and operations, debuggers face numerous challenges as core tools for locating and solving problems:

Multi-platform Compatibility: Applications need to run on different operating systems (such as Linux, macOS, Windows) and various hardware architectures (such as amd64, arm64), requiring debuggers to have good cross-platform capabilities.
Remote and Distributed Debugging: With the popularity of cloud-native and microservice architectures, debug target processes often run on remote hosts, containers, or sandbox environments, making traditional local debugging methods inadequate.
Security and Isolation: Production environments have strict security isolation requirements for debugging operations. Debuggers need to support the principle of least privilege to avoid impacting business systems.
High Performance and Low Intrusiveness: Debuggers need to minimize performance impact on the debugged process (tracee), especially in high-concurrency, low-latency scenarios.
Rich Debugging Features: Including breakpoints, single-stepping, variable inspection, memory examination, thread/goroutine switching, call stack analysis, etc., and potentially supporting multiple languages and runtimes (like gdb's support for multiple language debugging capabilities).
Artifact and Source Code Management: When debugging symptoms in production programs, how to quickly determine source code versions, and how to resolve source code path inconsistencies between remote debugging and artifact build time.
Intermittent and Deterministic Debugging: If problems can be stably reproduced, solving them is not far away. However, many problems in reality are difficult to reproduce. How to reproduce and achieve deterministic debugging is a major challenge.
Other Challenges: Readers may have their own pain points that the author hasn't listed, which is the real, complex world of computing.

To better address these challenges, debugger architectures have evolved and upgraded. Let's learn about the overall architecture of modern debuggers like gdb and dlv.

Note: The tinydbg design and implementation in this chapter is based on go-delve/delve@v1.24.1. Thanks to derekparker, aarzilli, and many contributors for their selfless dedication. Without their open-source spirit, my curiosity wouldn't have been satisfied to this extent, and this chapter wouldn't exist. To avoid the cropped dlv build and installation from overwriting readers' installed go-delve/delve, the author deliberately modified the module name of the forked and cropped code repository, changing module github.com/go-delve/delve to module github.com/hitzhangjie/tinydbg. We won't repeatedly emphasize that tinydbg originates from go-delve/delve in the following text, but if you compare them later, you'll find that tinydbg basically retains the original code structure, just cropping linux/amd64 and some content unrelated to teaching.

Modern Debugger Architecture Design

To better solve the various challenges faced by software debugging in reality, modern debuggers generally adopt a frontend-backend separation architecture, supporting extensibility in the UI layer, service layer, symbol layer, and target layer, as shown in the figure below:

Frontend-Backend Decoupling: Separating the UI/interaction layer (frontend) from core debugging logic (backend), with communication through standard protocols. This allows flexible adaptation to different frontends (CLI, GUI, Web, IDE plugins, etc.) and facilitates independent evolution and extension of the backend.
Multiple Communication Modes: Supporting local mode (such as process-internal communication implemented by pipe) and remote mode (such as network communication based on JSON-RPC, DAP), meeting both local and remote debugging needs.
Cross-platform Support: Backend core debugging capabilities adapt to multiple operating systems (windows/linux/macOS) and hardware architectures (amd64/arm64/powerpc), multiple file formats (elf/macho/pe), and multiple debugging information formats (DWARF/Stabs/COFF) through interface abstraction and conditional compilation.
Security and Isolation: Remote debugging can ensure production environment security through permission control and authentication mechanisms. For example, only allowing the backend to have partial PTRACE operation permissions instead of root permissions, and implementing user authentication and authorization for frontend debugging.
High Performance Optimization: Streamlining core functionality, reducing dependencies, and minimizing intrusiveness to the debugged process. For example, in production environments, only allowing ebpf-based tracing operations to track function time statistics, without allowing breakpoint operations.

When designing and implementing the content processing scheduling system in the PCG content platform, I was already driven to despair by the systems designed by predecessors. At that moment, I deeply understood this saying: "One of the core goals of software architecture design is to make invisible things visible." Reasonable architecture design clearly divides the functional boundaries of subsystems, with communication through agreed-upon protocols. The system's capabilities are reflected in subsystem division and subsystem protocols, rather than being clumsily mixed together like a sesame ball - who knows how many sesame seeds are in it?

Modern debuggers have basically evolved to the above architecture, such as gdb and dlv. Thanks to this architectural design, modern debuggers have excellent flexibility and adaptability, basically solving the various difficulties we mentioned earlier.

Frontend-Backend Separation Architecture

The functionality of a debugger mainly consists of three parts: UI layer interaction with users, symbol layer parsing, and target layer control of the debugged process. Everyone may be familiar with local debugging, such as using gdb, lldb, dlv, or IDE's built-in debugger to debug local programs. In local debugging scenarios, UI interaction, symbol parsing, and process control can all be completed in the same debugger process. Under what circumstances do we have to split it into frontend and backend debugger instances?

Security Policy Restrictions on Developer Machine Access

In some enterprises, due to security policy requirements, developers may not be able to directly log into test environment or production environment servers. In this case, if debugging programs running in these environments is needed, traditional local debugging methods cannot meet the requirements. There are several main problems:

Access Restrictions: Developers don't have login permissions for the server and cannot directly start the debugger on the server
Permission Isolation: Even if limited access is obtained through jump servers, necessary debugging permissions (such as ptrace permissions) may be lacking
Security Audit: Enterprises need strict auditing of debugging operations, recording who debugged which processes at what time

The frontend-backend separation debugger architecture provides a solution to these problems:

The backend debugger can be started by operations personnel or automated systems on the target server, only opening necessary debugging ports
The frontend debugger runs on the developer's local machine, communicating with the backend through network protocols
Security mechanisms such as authentication, authorization, and auditing can be added at the communication level
Debugging operations can be managed and controlled through a unified operations platform

This architecture both ensures enterprise security requirements and provides developers with remote debugging capabilities.

No Source Code on the Debugged Process Host

In real scenarios, there's a high probability of encountering such debugging problems. Of course, this depends on the enterprise and project:

Some can be tested in the developer's personal development environment, so there's no problem of missing source code;
Some need to be tested in a unified test environment, the test environment has no source code, but test environment management is often relatively loose, developers can rz upload source code;
Production environment management is often more standardized and strict, the production environment has no source code, developers update services and configurations through the operations system, and can be used for debugging after isolating traffic and preserving the scene, but rz uploading source code is not allowed;
Even if source code can be uploaded to test and production environments, source code versions must match, uploading takes time, and the uploaded source code path may not match the build-time source code path, and may not be solvable without root permissions;
Opening root permissions means security is out of the question.
If source code can't be uploaded, can we sz download the binary to local for debugging? You can download, but the target program might be linux/amd64, or linux/arm64, who knows? And your local machine might be windows or macOS.
Even if downloaded and the local machine matches the server (or manually finding such a machine), the scene is lost, making it difficult to locate flaky test problems without the scene.

Finally, this creates a dilemma, bringing challenges to debugging:

The process to be debugged runs on another machine Host-2, while my current machine is Host-1;
But Host-2 has no source code;
Host-2 has the scene, downloading to local Host-1 program may not be compatible, and the scene is lost;

Under the debugger frontend-backend separation architecture, this problem is easier to solve. Using the source code on the debugger frontend host for debugging, without needing to upload source code to the debugged process host, just asking the user for source code path mapping in the frontend, such as mapping /path-to/main.go to the build-time path /devops/workspace/p-{pipelineid}/src/main.go. See: https://github.com/go-delve/delve/discussions/3017.

CLI/GUI Debugging - Different Strokes for Different Folks

Some developers prefer using cross-platform consistent CLI debugging interfaces, while others prefer using VSCode for debugging, and still others prefer using GoLand for debugging. It may not just be a matter of preference, but different developers have different development habits and use different development toolchains. If our debugging only supports CLI debugging interfaces, or only supports GUI debugging interfaces, it would be very unfriendly and would reduce developers' debugging efficiency.

Under the frontend-backend separation architecture, this problem is easier to solve. Taking Go program debugging as an example:

CLI debugging interface, such as dlv frontend can communicate with dlv backend through JSON-RPC to complete target process debugging;
VSCode debugging functionality can communicate with dlv backend through DAP (Debugger Adapter Protocol) to complete target process debugging;

Due to the frontend-backend separation architecture, we can independently develop new debugger UIs for more convenient debugging:

For example, dlv is a CLI debugging interface, we can develop aarzilli/gdlv.

Different OS/Architecture Between Current and Target Machines

Earlier we mentioned the problem of no source code on remote machines, and also mentioned a related problem - the situation where the current machine and target machine have different operating systems or hardware architectures. This situation is very common in actual development and debugging:

Developers use MacOS or Windows for development, but need to debug programs running on Linux servers
Development machines are x86_64 architecture, but need to debug programs running on ARM architecture servers
In containerized environments, the operating systems and architectures inside and outside containers may be different

These differences bring the following challenges:

Locally compiled debuggers may not run on target machines
Debuggers need to understand different platform executable file formats (such as ELF, PE, Mach-O)
Debugging-related system calls may be completely different on different platforms
Low-level details like registers and memory layout have differences

The frontend-backend separation architecture provides an elegant solution to these problems:

Backend debuggers can be compiled and deployed separately for target platforms
Frontend debuggers only need to focus on user interaction, without worrying about platform differences
Platform details are shielded through standardized communication protocols
Different platform programs can be debugged under the same frontend interface

Summary

The several problems discussed earlier - security compliance, source code access, UI preference differences, platform differences - are the main reasons that drive us to adopt the frontend-backend separation architecture. This architectural design elegantly decouples the debugger's user interface logic from underlying platform implementation, thus better addressing these challenges.

Combining the architectural design, we split the debugger into two core components: Frontend and Backend:

Frontend is responsible for all user interaction-related functions, including receiving user debugging commands, displaying debugging results, managing debugging sessions, etc. It focuses on providing a smooth user experience without needing to care about underlying implementation details;
Backend is responsible for implementing specific debugging functions on different platforms. Taking Linux/amd64 platform as an example, it needs to parse DWARF debugging information in ELF files, control the debugged process through system calls, and collect necessary runtime information;
Frontend and Backend interact through standardized communication protocols. Frontend converts user debugging instructions into commands that Backend can understand, Backend executes these commands and returns results, and Frontend formats these results for presentation to users.

This separation architecture not only solves the aforementioned problems but also provides a good foundation for future expansion and improvement. We can independently improve the frontend interface or add new backend platform support without affecting each other.

Communication Modes

The frontend-backend separation architecture of debuggers cannot do without communication between frontend and backend. What should we consider for this communication?

Different Processes: JSON-RPC over network

If frontend and backend run on different machines, there's no choice but to use network communication. The Go standard library provides json-rpc capability, and we can easily implement frontend-backend communication using the Go standard library.

Once we implement frontend-backend json-rpc communication capability, we can actually solve the problem of frontend and backend running on the same host, just changing the network address to local loopback address localhost/lo.

For the case of running on the same host, we need to consider more carefully, be more user-friendly, and design more elegantly.

ps: Of course, on the same machine, if it's communication between two debugger instances, besides TCP communication, Unix communication can also be chosen.

IDE Integration: DAP over network

To integrate debuggers into IDEs, they need to follow the IDE's Debug Adapter Protocol (DAP). DAP is a standardized protocol that defines the communication format and process between IDEs and debuggers.

DAP uses JSON-based message format, transmitting data through TCP network. Although the message format is similar to JSON-RPC, DAP defines its own message structure, including dedicated fields like sequence and type. It defines a series of standard request and response messages, including:

Starting/attaching debugging sessions
Setting/deleting breakpoints
Single-stepping/continuing execution
Viewing variables/call stacks
Expression evaluation

By implementing the DAP protocol, our debugger can seamlessly integrate into DAP-supported IDEs, such as VS Code, GoLand, etc. This way, users can use our debugger in their familiar IDE environment without switching to a dedicated debugger interface. Additionally, if an IDE plugin implements DAP-based debugging for a programming language, it also means it can switch between different debugger backends, such as when debugging Go language, VSCode can switch from debugger implementation dlv to gdb, etc.

Same Process: ListenerPipe

If both frontend and backend run on the local machine, it seems that just the UI layer, symbol layer, and target layer would be sufficient. In this form, converting user debugging actions to target layer process control would be simple upper layer calling lower layer encapsulated functions. But we've already split into frontend-backend separation architecture, and clearly defined that frontend and backend need to interact through service communication. If we just bypass the service layer to make function calls to the target layer because it's local, it would make the boundaries between layers unclear, inelegant, and unnecessarily complex.

So can we choose json-rpc and completely use network communication? We can request a port from the operating system and use this port for communication to avoid port conflicts with other debugger instances or other local processes.

This is a feasible solution, but let's consider it more carefully:

Running two frontend and backend debugger instances on the same machine, communicating through json-rpc, but this multi-process architecture, using network communication on the same machine, is not elegant. Taking Linux as an example, why not use more efficient communication methods like pipe, fifo, shared memory for parent-child processes?
Assuming we still use two debugger instances, the first started debugger instance is the parent process, it should be the frontend process, it also needs to start a child process, then establish a pipe between them for inter-process communication? This multi-process + pipe approach is more common in c/c++ single-process single-thread programs, while Go itself is coroutine concurrent, can't we just use one process + pipe to achieve similar functionality? And the standard library does provide net.Pipe to return such a pipe for goroutines to communicate.
Understanding these, now we need to consider how to design the service layer communication. The service layer involves nothing more than network communication, frontend involves net.Dial(...) establishing connection net.Conn, while backend involves net.Listen(...) getting listener and getting newly established connection net.Conn through listener.Accept(...), then frontend and backend communicate through their respective established net.Conn. For json-rpc itself, it's using Go's network library, so these operations are naturally no problem. But if we want to let frontend and backend service layers communicate through net.Pipe in the same process, while aligning with network communication interfaces, we can custom implement net.Listener, such as ListenerPipe, which internally includes a net.Pipe, backend gets a net.Conn implementation through ListenerPipe.Accept (essentially one end of net.Pipe), while frontend can also get the associated net.Conn implementation (essentially the other end of net.Pipe) to communicate with backend.

Summary

This way we've implemented the service layer communication problem for debuggers running on the same host and across different hosts. When across different hosts, we can use json-rpc for communication, when on the same host we can also use json-rpc communication, or use single process + ListenerPipe implementation. For IDE integration, we need to implement the DAP protocol.

Later we'll discuss how the debugger runtime decides whether to run in frontend-backend single process mode or frontend-backend separation mode.

Platform Extensibility

When discussing the frontend-backend separation architecture, we mentioned that frontend and backend may run on different types of hosts, and these hosts' operating systems and hardware architectures may have significant differences. These differences may cause our debugger to run well on one operating system and hardware platform combination, but crash directly or be unable to run on another combination.

Debugging actions are relatively enumerable, such as setting breakpoints, reading/writing memory, reading/writing registers, single-stepping, etc. We need to convert these into target layer operation sets, and when implementing these target layer operations on different operating systems and hardware platforms, we need to consider the differences between different platforms.

Here we need to make necessary abstractions for the target layer operation set, such as extracting a Target interface{}, which includes all operations on the target process, then different operating systems and hardware platforms provide corresponding Target interface{} implementations.

Debug Object Extensibility

What we want to debug may be a running process, or it may be a core file (also commonly called coredump file) generated by a process that has already died.

For running processes, as long as they're still running, you can almost get all their state information through the operating system, but once they crash, it's impossible to restore all the process runtime state information just through the core file generated before it crashed. Core files usually only record the call stack information before the process crashed, to help developers understand where the program finally encountered a fatal, unrecoverable error.

We mentioned earlier that Target interface{} is for controlling the target process, so of course it can't do without reading and writing process state. Here we need to consider the differences in reading and writing state between real processes and process core files, and need to consider extracting a Process interface{}, with processes and core files providing corresponding implementations.

File Extensibility

The formats of executable files and core files generated on different operating systems are different, such as Linux mostly using ELF, Darwin mostly using Macho, Windows using PE, and for core files, Linux uses ELF, Windows uses PE, Darwin is unknown.

These file format differences inevitably lead to certain differences when reading files and reading debugging information, for example:

Their file headers are all different;

The section names storing debugging information may also be different, such as some with zlib compression placed under .zdebug sections, some without compression placed under .debug sections;
They may not even put debugging information data in the binary program file, such as Darwin possibly putting debugging information in the .dSYM directory at the same level as the binary program;
Even more, they may not necessarily use DWARF debugging information format;

Therefore, appropriate abstraction is needed for executable file description to shield the differences of executable files on different platforms.

Debug Information Format Extensibility

Debug information formats may also be different. DWARF is a latecomer, and more and more languages and toolchains are using DWARF as their debugging information format, such as the Go toolchain using DWARF as its debugging information format.

Since this book mainly introduces the design and implementation of Go symbol-level debuggers, and the Go compilation toolchain itself also uses DWARF, we don't really need to mention debugging information extensibility. But no one can guarantee that a more descriptive, more efficient, less space-consuming debugging information standard won't appear in the future. Even if it doesn't, DWARF itself is an evolving standard, from its widely accepted version v4 to today's v5, there are still some differences, so when we debug binary programs carrying different versions of DWARF data, we also face this difference problem.

For this, we can consider making certain abstractions when loading, reading, and parsing debugging information formats, thus shielding the differences between different versions of DWARF, and even different debugging formats.

As a latecomer, DWARF's predecessors (such as Stabs, COFF, PE/Coff, OMF, IEEE-695, etc.) can no longer defeat it. If readers are interested in these once-famous standards, they can refer to: Debugging Information Format.

Debugger Backend Extensibility

After implementing the frontend-backend separation architecture, debuggers also give us more flexibility. Our own debugger implementation needs to be separated into frontend and backend parts, so can other debuggers like gdb, lldb, mozilla rr also be used as backends for our debugger frontend? Yes, they can.

Why would we have such a requirement?

Suppose our implemented debugger backend part lacks a function, such as dlv not having the ptype ability to print type information, but gdb has this ability, can I use dlv's frontend to connect to gdb's backend to implement the ptype function?
For another example, I now want to implement reverse debugging functionality, but dlv doesn't have this ability, but I know mozilla rr (record and play) can implement reverse debugging, can I use dlv's frontend to connect to rr to implement reverse debugging functionality?

To enable our backend to support dlv backend, gdb backend, rr, we can also make necessary abstract design, so when debugging we can specify the --backend parameter to start different backend implementations.

First, it's important to clarify that the debugging capabilities provided by our tinydbg debugger frontend are applicable to all backend implementations (including tinydbg backend, gdb, lldb, mozilla rr). Since the debugger frontend is only responsible for UI layer interaction and display, when we want to switch different debugger backends, we need the debugger frontend to inform the debugger backend through request parameters, and the debugger backend here selects the corresponding implementation based on --backend, such as native (tinydbg), gdb (gdbserial accessing gdbserver), lldb, rr (gdbserial accessing mozillar rr).

Summary of This Article

This article has detailed the key extensibility issues in debugger design. We've analyzed the extensibility that debuggers need to consider from multiple dimensions, including:

Abstraction of debugging actions and extensibility of target layer operations
Extensibility of debug objects (processes and core files)
Extensibility of executable file formats under different operating systems
Extensibility of debugging information formats (such as DWARF and its versions)
Extensibility of debugger backends

Through frontend-backend separation architectural design and reasonable abstraction at various levels, modern debuggers like gdb and dlv have achieved good extensibility. This enables them to adapt to different usage scenarios, including:

Daily debugging in local development environments
Debugging in remote server or container environments
Automated debugging in multi-platform CI/CD processes
Safe problem diagnosis in production environments

This extensible design not only improves the debugger's adaptability but also provides a good foundation for future function expansion and optimization. The subsequent content of tinydbg design and implementation will also introduce these aspects.

References

go-delve/delve, https://github.com/go-delve/delve
gdb, https://sourceware.org/gdb/
mozilla rr, https://rr-project.org/

9.1.1 Modern Debugger Architecture

Software Debugging Challenges and Modern Debugger Architecture Design

Challenges of Software Debugging in Real Environments

Modern Debugger Architecture Design

Frontend-Backend Separation Architecture

Security Policy Restrictions on Developer Machine Access

No Source Code on the Debugged Process Host

CLI/GUI Debugging - Different Strokes for Different Folks

Different OS/Architecture Between Current and Target Machines

Summary

Communication Modes

Different Processes: JSON-RPC over network

IDE Integration: DAP over network

Same Process: ListenerPipe

Summary

Platform Extensibility

Debug Object Extensibility

File Extensibility

Debug Information Format Extensibility

Debugger Backend Extensibility

Summary of This Article

References

results matching ""

No results matching ""