I've spent a lot of time jumping in to optimise horrendously complicated programs and data pipelines and things. The big wins are from understanding the domain and spotting all the inevitable misunderstandings and unnecessary steps that creep in when systems are so big they get split up and built/maintained by different people at different times.
6. Negotiate the required behaviour with stakeholders.
7. UX changes. E.g. make a synchronous flow a background job and notification. Bring back quick parts of the operation sooner (e.g. like a progressive jpg)
8. Architecture changes. E.g. monolithification, microservification. Lambdas vs. VM vs. Fargate etc.
And some more technical:
9. Caches?
10. Scalability, more VMs
11. Move compute local (on client, on edge comoute, nearby region)
11a. Same as 11 for data residency.
12. Data store optimisation e.g. indices, query plans, foreign keys, consistent hashing (arguably a repeat of data structures)
13. Use a data centre for more bang/buck than cloud.
14. Compute type: use GPU instead of CPU etc. I'll bundle here L1 cache etc.
15. Look at sources of latency. proxies, sidecars, network hops (and their location) etc.
16. GC pauses, event loops, threading, other processed etc.
This is missing the 5th and most important optimization technique: don't pessimize the code in the first place.
The usual culprit is "premature modularization", where code that is used in one place and is never going to be extended is nonetheless full of indirections.
In principle it can in general fit the points 1-3 when you view less abstractions as lower level system and code as a data structure and algorithm, what can also include different levels of parallelism.
Optimize for maintainable (readable) code with sane, understandable flows. then optimize for the user experience. Processors complete lifetimes when a human blinks. If you're doing /meaningful/ work with those cycles, that's fine. If it's bloat... trim bloat.
The blog post is really good. I see that there's a follow-up piece 'The Fifth Kind of Optimisation' about parallelism.
Something that I'd like to add is that it's helpful to understand the optimization capabilities of our compiler. Ideally, we would like to write program that doesn't have what are called 'optimization-blockers' - these make it hard for the compiler to generate optimized executables.
I like the pointer to the blog on accidentally quadratic implementations. I find that the following pattern is often a landmine:
for (int i = 0; i < strlen(s); i++)
// code in loop
strlen(s) gets computed every iteration, incurring O(n) time.
Finally, being aware that I/O latencies are major source of bottlenecks leads to nice optimizations. One advantage of multiple threads is that they can sometimes hide the I/O latency. In general, writing programs with good memory locality is one of the better levers for optimization.
Python is not something I expect to see when talking about performance. Being interpreted with powerful builtins it distorts what is fast and what isn’t in a fairly unpredictable way.
APL programmers would say it's already maximally optimized for readability (actually, it could be improved further with less and shorter keywords and stdlib function names).
That's not even a joke or dig at APL programmers, whether that code would be more or less readable with longer, more descriptive names is entirely a matter of perspective, preference, and context.
I've spent a lot of time jumping in to optimise horrendously complicated programs and data pipelines and things. The big wins are from understanding the domain and spotting all the inevitable misunderstandings and unnecessary steps that creep in when systems are so big they get split up and built/maintained by different people at different times.
This! Extending the list of 4:
5. Remove unrequired behaviour.
6. Negotiate the required behaviour with stakeholders.
7. UX changes. E.g. make a synchronous flow a background job and notification. Bring back quick parts of the operation sooner (e.g. like a progressive jpg)
8. Architecture changes. E.g. monolithification, microservification. Lambdas vs. VM vs. Fargate etc.
And some more technical:
9. Caches?
10. Scalability, more VMs
11. Move compute local (on client, on edge comoute, nearby region)
11a. Same as 11 for data residency.
12. Data store optimisation e.g. indices, query plans, foreign keys, consistent hashing (arguably a repeat of data structures)
13. Use a data centre for more bang/buck than cloud.
14. Compute type: use GPU instead of CPU etc. I'll bundle here L1 cache etc.
15. Look at sources of latency. proxies, sidecars, network hops (and their location) etc.
16. GC pauses, event loops, threading, other processed etc.
This is missing the 5th and most important optimization technique: don't pessimize the code in the first place.
The usual culprit is "premature modularization", where code that is used in one place and is never going to be extended is nonetheless full of indirections.
In principle it can in general fit the points 1-3 when you view less abstractions as lower level system and code as a data structure and algorithm, what can also include different levels of parallelism.
> In principle it can in general fit the points 1-3
In principle, I don't think people would lump it in.
Optimize for maintainable (readable) code with sane, understandable flows. then optimize for the user experience. Processors complete lifetimes when a human blinks. If you're doing /meaningful/ work with those cycles, that's fine. If it's bloat... trim bloat.
The blog post is really good. I see that there's a follow-up piece 'The Fifth Kind of Optimisation' about parallelism.
Something that I'd like to add is that it's helpful to understand the optimization capabilities of our compiler. Ideally, we would like to write program that doesn't have what are called 'optimization-blockers' - these make it hard for the compiler to generate optimized executables.
I like the pointer to the blog on accidentally quadratic implementations. I find that the following pattern is often a landmine:
for (int i = 0; i < strlen(s); i++) // code in loop
strlen(s) gets computed every iteration, incurring O(n) time.
Finally, being aware that I/O latencies are major source of bottlenecks leads to nice optimizations. One advantage of multiple threads is that they can sometimes hide the I/O latency. In general, writing programs with good memory locality is one of the better levers for optimization.
Python is not something I expect to see when talking about performance. Being interpreted with powerful builtins it distorts what is fast and what isn’t in a fairly unpredictable way.
Previous discussion from 2023: https://news.ycombinator.com/item?id=38262251
Recent discussion on the follow-up, "The Fifth Kind of Optimisation": https://news.ycombinator.com/item?id=43555311
The first thing I thought when I saw your code was “what about optimizing for readability?”
APL programmers would say it's already maximally optimized for readability (actually, it could be improved further with less and shorter keywords and stdlib function names). That's not even a joke or dig at APL programmers, whether that code would be more or less readable with longer, more descriptive names is entirely a matter of perspective, preference, and context.
> it may well be fast enough, which is what really matters.
This deserves to be a headline in most optimisation discussions. Fast enough or small enough is often all that matters, start there.
5. Use a higher level system that can offer better optimizations.
This can include more advances VM or using other advanced problem oriented systems like databases or problem solvers.
5. Find a better data source