Upside Down Research's Software Stack
Or, What systems do you use?-
Use a static language.
Go, Rust, and Scala are today's kings, in my judgement. I've used all three, and they all have loosely similar development characteristics. Other static languages seem a little more niche, from where I sit (they have their purpose, naturally). For example, C# is huge. I would of course look at C# if I was starting a Windows-oriented project. Swift is the standard on iPhones. And so forth.
Upside Down Research generally uses Rust, because Rust has exceptional type-system safety, without the Scala JVM overhead. We have some Go programs, in order to deliver simple web programs.
-
Two Git repositories.
A monorepo for code, a monorepo for infrastructure (Terraform, Pulumi, etc). A monorepo allows a unified interaction with the codebase. Any significant project spanning multiple services can be performed in at most two business process loops(infrastructure + code) , rather than the "n" loops required for many-repo projects. Naturally, if your many repos are split up just right, you won't run into that issue, but in practical fashion, that won't happen, and you'll be stuck trying to arrange all the merges. Not good.
Monorepos do require careful attention to continuous integration and deployment. The solution of "build the whole repository to avoid breakages" involves, unfortunately, "building the whole repository". I've seen very clever Bazel systems that only build the minimum required.
My advice: eat the monorepo tooling pain and enjoy the overall software delivery acceleration.
So what am I doing now? Upside Down Research has a repo for infrastructure, a repo for product code, a repo for website code, and a specialized repo for more-secret infrastructure.
- GCP > AWS for most things. I am strictly evaluting this on
developer experience and SSO facilities when user is
on Google Workspace.
It is stereotyped that Google has worse customer service than Amazon. I have not found that so, but I would also not be terribly bothered by hosting on AWS.
- Use Kubernetes.
The custom solution you made and like is just a subset of it. I have maintained and built several custom non-k8s solutions in the cloud, and I whole-heartedly endorse using Kubernetes, rather than putting myself through that pain again. I know there are success stories with Kubernetes on-prem as well using tools like k3s.
My preferred Kubernetes platforms are GKE and EKS. I've tried GKE Autopilot for very small workloads, but the pricing didn't work out nearly as well as I initially estimated it would.
I'm very interested to see how Kubernetes is experienced by companies building out monster on-prem GPU clusters.
- Build system should be fast. Optimize for dev time.
This has a tension with using a monorepo. I'm looking periodically at implementing Buck2 (A bazel-a-like) for UDR's product code repository, and expect I will fully move after a certain point.
-
Ephemeral deploy and test.
Merge request / pull request builds should deploy from branch to target. Run the integration tests on a temporary endpoint. This guarantees that the build can be built! If you can't build your branch and deploy it, and test it - the software will be broken when it gets merged.
-
Test on Prod
Automated integration tests on developer mainline and prod. Always. Without this, you will not know if a defect is going to be delivered to your users.
-
Use Postgres
Yes, Memgraph, Mongo, Elastic, DynamoDB, BigQuery all exist. But, in general, your code does not need those facilities. Postgres has become the de facto standard database used in the Linux world - commercial plugins are made; new databases use the Postgres connection protocol. SQLite is feasible, if - and only if - you can assert your application conforms to the SQLite expectations. I tend to use the "RDS" like managed SQL facilities cloud providers offer - properly managing backups and replicas is not trivial, and it doesn't contribute to software being delivered on time. Naturally, this advice is not applicable past certain points. But, if you have started with Postgres, you will be stable for a while in your technology, and you can scale very far.
-
Avoid ORMs.
Their defects have been remarked on for some twenty years. They are not faster than understanding SQL. Query builders are an excellent choice as means of providing correct SQL generation, without locking yourself to the choices of the ORM designers.
- REST > gRPC for 1.0.
gRPC/protobuf is very attractive as a standard. It promises a shared interop means, has good performance, and provides an excellent interface to your system. Unfortunately, there are two essential problems with gRPC/protobuf.
-
curl
does not work. Basic HTTP interaction is a well supported standard for debugging and exploration; many tools exist. Lacking this, tools like grpcurl are created, but requires, for example, access to the protobufs. - Having a schema "freezes" your system. In pre-1.0 APIs, your schema is moving. Schemas are not bad - they are good. But they are a tradeoff between speed of development and stability. So: schemas are for when you're at 1.0.
-
- Monolith that system. But design it well.
Microservices have been the "fancy" cloud design standard du jour for, well, the last decade, maybe the last decade and a half. There are two fundamental reasons to use microservices as a general architecture: to remove team communication requirements, and to address sharp distinctions in domain requirements.
The first reason is, in my experience, the conventional reason to use microservices, and, in my judgement, the wrong one. To short circuit team communication and appropriate technical leadership will cause problems.
The second one is more compelling. For example, if we look at a high level, a web app and a database are performing, approximately, two separate domain requirements.
This can also be used for addressing differing security needs. Billing information is not particularly related to most application needs, for example. Or, perhaps, one service needs to exist in a separate network domain than another (example: VPC peering with customer).
Specific engineering elements around scaling can also indicate splitting out the code that needs to scale up.
But microservices induce significant concerns. Among them: debugging, network communication cost, serialization/deserialization concerns, deployment ordering, and code duplication. This can be a price you want to pay.
Monoliths are, fundamentally, simpler. I assess any question around microservices and service boundary design as follows:
- Does measuring the system clearly indicate we should do this?
- Does this lead to a simpler system, even given the costs indicated above?
- Is this proposal cheaper, even given the costs indicated above?
-
Use the Grafana stack: Grafana, Mimir, Loki, Tempo.
I admit it - I'm a Grafana fanboy. Their software works really well. The backend systems are essentially a data service layered over blobstores like S3 or GCP Storage Buckets. It is very cheap to set up a Grafana stack, integrate it with SSO.
Note that Tempo and Mimir are expensive(ish) to run on the compute side - it makes more sense to use Jaeger and Prometheus for your traces and metrics until you start to push the limits of those systems.
Grafana also has a SaaS offering, which has a small free trial. I recommend it for starting out.
Right now for the Housecarl AuthZ product development side, I'm experimenting with using the GCP observability stack. It's not quite as integrated as the Grafana system, but it requires less setup.
And that brings me to the two big cons of Grafana - their Helm charts for Kubernetes are a bit of a mess, and their documentation is not good. Once you understand the system, it works out well enough, but the ramp can be very steep.
-
Don't worry about code style. Worry about
naming and system design. Clarity of thought beats
standardization.
Like many software engineers, I spent a lot of time when less experienced reading coding style guides.
This did not help me be more productive. I'm not sure it helped me write better code. It did help me get into arguments with other people.
The style guide I eventually settled on years ago was, "write readable code. If the team doesn't like it, they will say so in review". This worked exceptionally well in the small Clojure/Scala team I was part of at the time.
The questions you need to attend to in your software delivery are: is the system working - does the software map to the domain - does it convey clearly what it does to the reader? Making cute names such as "anotherDumbVariable" is- well, not helpful.
- Use tracing & custom metrics as soon as possible.
Tracing is essentially a tree based "log" system that allows key-value pair to be attached per "node", or "span" in tracing lingo. The way I use tracing is to add a span per critical function, annotating the span with relevant information. This lets me look up exactly what occured. The developer relations team of Honeycomb.io has spent a lot of time arguing that tracing is essentially the superset of all observability. I am, loosely, convinced, particularly for the app case.
Custom Metrics is the other key approach I use to understand behavior. These are basic time series, with simple tags attached to a given series. This is the classic Prometheus/Datadog style metrics.
This should be done as soon as possible - retrofitting in tools for understanding is fiddly and frustrating, and will inevitably leave huge gaps.
- Gitlab >> GitHub, Bitbucket, and other forges.
Gitlab provides a soup to nuts software delivery solution. The issues, the CICD, the project management, all work well, without having to perform horizontal integration with other tooling. It doesn't have the cachet or the social contribution facility of Github, to be sure. But it, critically, gets the job done without issues.
-
And finally, the really hot take. Don't
worry about dashboards. Know what to query.
A dashboard simplifies the underlying signals, which inevitably is derived from whatever your systems emit (logs, metrics, traces).
Focusing your attention on dashboard creation implicity and inherently removes information that can be critical to solve a given problem. Ergo, a dashboard will likely mislead when an interesting novel problem shows up in your production system.
Thus, the information needs to be substantial enough and descriptive enough to allow developers to derive appropriate information from what is actually occurring. Which in turn entails knowing what to query, along with developing discoverable and useful signals.
And, of course(hot take nuance time :-) ), one day you will find that certain signals are your standard for reviewing, and saving them in a "board" works very well. But now the cart goes behind the horse and this works well.
To summarize and review the overarching principles here:
- minimize complexity
- minimize moving parts
- minimize integration
- Don't add dependencies without measurable, numerical, reasons.
The author is not endorsed, affiliated, compensated, or otherwise kicked-back by mention of any of the businesses mentioned here. :-)