The Limitations of Abstractions in Modern Computing Languages
Abstraction in modern, high level computing languages can obscure important implementation details leading to disordered and ineffective system design. Abstraction, leveraged effectively, should be enabling quick and effective reuse of shared or redundant portions of code, not encouraging engineers not to understand or audit implementation details of components involved in system design. It is proposed here that the theoretically optimal system design language involves the least layers of abstraction and the greatest degree of reuse, extensibility and portability across shared code libraries (1). Proposed below are several proof of concept tools to remove layers of abstraction from Python and C/C++ source files, an expositional example of a deeply misleading user defined type name in a popular C/C++ code base (Node.js), and some proposals for further inquiry into abstraction-less (or abstraction-minimal) system language design.
(1) Note that this implies by conjunction that “readability” (as intuitive correlation to natural language) and “understandability” (as minimal time to auditing and understanding of the system by engineers) are not desirable traits of a system design language.
The Theory
Take a look at the below… (an example taken from a source file of the Django web framework https://github.com/django/django/blob/main/django/core/management/color.py
Django vs. Django randomified
Now, imagine, if you will, you just found a bug in your web app related to Django’s handling of color semantics (I’m not sure that’s what this particular source file is related to, but suppose for the sake of exposition you arrive at this source file with a need to understand whether or not it is involved in the processing of color in your application or not). Take a moment, have a look at the file above, and put yourself in the shoes of an engineer trying to audit a portion of the codebase involving these files. How would you understand what the files were intended to do? How would you get a grasp of their place in the system?
Typically when you audit a piece of code as an engineer you’re interested in two things:
Whether the source file is involved in the part of the system related to the function you are interested in (for debugging or extension).
How the source file accomplishes that function, or related to other components to accomplish that function.
Note that above the word function means service to a user or other software system, not tightly defined call point in a source file. Also note that these questions are inexorable and you might need to know #2 to answer #1. This is an interesting insight, because, if you thought that natural language semantics, of the sort supported by modern, high level computing languages, effectively captured the “function” of the system, as defined above, you could simply read the natural language semantic and answer #1 immediately, but this isn’t the case.
In the context of answering the above questions typically abstractions (nomenclature used to define variable and class names) provide only a sign post to portions of the system likely involved in the provisioning of a particular function. In this way they can be useful as a guide through large code bases where otherwise finding the source files relevant to a particular component or function would be untenable. The mistake engineers tend to make though is “stopping at the variable or class name” and progressing with a “best guess” based on the variable or class name. Modern programming languages are innately designed to support this.
Note that part of what I’m pointing out here is an element of engineer experience. Particularly how do you think about the code as an engineer… And by extension, how do the tools you use guide and encourage you to think about the code? Programming languages are shaped around ideas, and the ideas that form them nudge you on your journey in development. Most modern computing languages completely fail the notion of abstraction and empower poor code auditing practices.
The best programming language is the language most engineers hate.
Because it would force engineers to either:
Understand the code through auditing.
Understand they don’t understand the code and progress accordingly.
There would be no layers of abstraction to lean on. The reality is a 30 character CamelCaseClassName will not capture the complexity of what is actually occurring beneath all those layers of abstraction, not even close. Not even the basic structure of a class in a high level programming language can be accurately represented that way, let alone the complexities of the underlying language structures supporting the higher level structure. If you don’t believe me pick any large C/C++ code base (I’ve been playing around with Node.js (https://github.com/nodejs/node), use the “nofu” tool provided below, and “Destruct” a a moderately complicated struct with 2–3 nested structs in it. What’s it actually look like? Does the variable name capture it?
An Example From Node.JS
Here’s an example…
Take the following user defined type definition from the openssl library (as of 07/07/2023 defined at https://github.com/nodejs/node/blob/main/deps/openssl/openssl/crypto/x509/x509_local.h)
Suppose I’ve reached this portion of the code and I want to understand what’s going on. Intuitively, looking at the name of the function I can tell that I’m probably accessing some data about an x509 certificate. I even got a bit of extra context: “this is the functions” (kidding… this is a meaningless statement and not a helpful comment). Immediately though I’m left with a ton of other questions:
What are “the functions”?
Why do I need to lookup anything related to the handling of an x509 cert?
Where is the actual certificate data stored?
How does this structure related to the storage of the actual certificate data?
The name in this context provides nothing more than a suggestion that this code construct has something to do with an x509 certificate, and it’s a series of functions and supporting contextual data (which is redundant, since we could’ve deduced this just by looking at the struct itself). But none of 1–4 can be answered by looking at the the struct itself, because they require knowing what’s going on, how the struct is used, and what it really is.
Now I have a few paths as an engineer working in a modern computing language that supports these types of abstractions:
Trust what I was able to glean from what I just audited: “x509_lookup_st is a struct that contains some functions that do a lookup and some contextual data” and continue my process of auditing using my gestural inclination of what this means
Dive into the struct and piece together a picture of what is actually is to put the remainder of the component I am auditing in context
Which path an engineer might take depends on their personal inclinations, level of experience, desire for growth and time constraints, however it is intuitively obvious some level of #2 is ideal. Of course it is not possible to build a complete picture of the system all the time, even when we need or would benefit from having a complete picture, but, as mentioned above, in lieu of this, it is better to design a system accounting for unknowns than rely on speculative guesses.
Progressing at this point relies on an observation that I have to hold a priori in this text, and that I would encourage any budding computer scientists / software engineers to examine through experimentation for themselves.
Building systems based on intuitive guesses derived from natural language semantics results in incoherent systems. An engineers “best guess” at the juncture outlined above is often wrong and in many cases fails to capture the complexity and necessary details of the underlying components that are imperative to understand while running down bugs and building new functionality “on top of” or in relation to existing components.
Let’s juxtapose a few things.
The name of the user defined type.
x509_lookup_st
The actual syntactic definition of the user defined type
And finally the “Destructed”, “in memory” view (un-nest all nested struct definitions and remove any type definitions that aren’t primitive types or enums) a la (https://github.com/NWc0de/Nofu) (2)
The “in memory” view is 1,400 lines, the screenshot below captures only 1/4 of it.
x509_lookup_st represented in primitive memory form
Glancing at https://gist.github.com/NWc0de/ff80e9c1c31e49426af0a1f9f3e6b712 the first impression may be that I’ve simply capriciously mangled the source file, you might think there’s no possible way what you’re looking at could actually be what x509_lookup_st actually is in memory, but it is. If you don’t believe me you can walk through all of the nested definitions that comprise x509_lookup_st and consider what the struct would look like “unwound” manually. The representation provided by Nofu (outlined above) is consistent with this exercise.
All Nofu does is take a struct definition and iterate through each line of the struct, every time it encounters a user defined type that is known to be another struct definition it recurses, eventually replacing every struct definition inside the top level struct definition with a representation of the struct definition that removes any nested structs and replaces them with a representation of their type, bottoming out if it encounters a loop (a struct which contains a reference to itself).
See https://github.com/NWc0de/Nofu/tree/main/Demo for a simple example.
That’s what happened here with x509_lookup_st, and the result is a 1,400 line type representation, which itself is still an oversimplification as Nofu doesn’t handle Enums or other user defined type like’s classes (where things will get really untenable).
Now let’s return to the problem at hand. I’m an engineer tasked with fixing a bug in openssl related to the handling of x509 certs, and during auditing I come across not only x509_lookup_st but many structs of this type, and to fix the bug I need to understand the implementation details buried somewhere in the partially cyclically defined structure represented by https://gist.github.com/NWc0de/ff80e9c1c31e49426af0a1f9f3e6b712.
Further, while auditing all I have is the name of the struct to go off, and two paths I can take:
Trust what I was able to glean from the name: “x509_lookup_st is a struct that contains some functions that do a lookup and some contextual data” and continue my process of auditing using my gestural inclination of what this means
Dive into the struct and piece together a picture of what is actually is to put the remainder of the component I am auditing in context
Neither of which are conducive to building a complete enough model to make the changes I need to. But most damaging in this context would be making the presumption that “a struct that contains some functions that do a lookup and some contextual data” captures the complexity of https://gist.github.com/NWc0de/ff80e9c1c31e49426af0a1f9f3e6b712, needless to mention all the important details contained within that would influence the design or my ability to determine how a defect occurs.
(2) There are a few bugs in Nofu as of 07/07/2023 so some structs are left unparsed and there are a few formatting errors (excessive tabs and poor placement of destructed structs). Nonetheless the core functionality is there. I will post an updated “clean” view when I get around to polishing up Nofu).
WIP
TODO: go on to illustrate the loss in meaning if progressing #1, provide an example of an alternative language design, discuss implications
Tooling (PoC)
If you find the ideas interesting and want to explore what Python files look like without engineer developed abstractions or what structs look like when they are stripped down to their primitive type representation and nested layers are sucked into one type representation, you can try the tools below out.
Remove Abstractions from Python Files
A utility to remove natural language semantics imbued in user defined types from Python source: NWc0de/Blind: Obfuscate your Python code
Represent Structs as “In Memory” Objects
A VS Code plugin that will “destructify” structs in C/C++ repos, and display them based on their “in memory” representation and not their “nested abstraction” representation: https://github.com/NWc0de/Nofu
/* This is the functions plus an instance of the local variables. */ struct x509_lookup_st { int init; /* have we been started */ int skip; /* don't use us. */ X509_LOOKUP_METHOD *method; /* the functions */ void *method_data; /* method data */ X509_STORE *store_ctx; /* who owns us */ };
/* This is the functions plus an instance of the local variables. */ struct x509_lookup_st { int init; /* have we been started */ int skip; /* don't use us. */ X509_LOOKUP_METHOD *method; /* the functions */ void *method_data; /* method data */ X509_STORE *store_ctx; /* who owns us */ };
Destructed structName | | PrimitveType1 typeName1; PrimitveType2 typeName2; ... | | PrimitveType3 typeName3; .... | nestedStructTypeName; PrimitiveType4 typeName4; |