I would really love to have a better understanding of what is going on here!
If you format a 32 MB disk using NTFS, you’ll get the following result:
So about 10 MB are taken for NTFS metadata. I guess that makes sense, and giving up 10 MB isn’t generally a big deal these days, so I wouldn’t worry about it.
I write a 20 MB file and punch a hole in it between 6 MB and 18 MB (12 MB in total), so we have:
And in terms of disk space, we have:
The numbers match, awesome! Let’s create a new 12 MB file, like so:
And the disk is:
And now I’m running the following code, which maps the first file (with the hole punched in it) and writes 4 MB to it using memory-mapped I/O:
HANDLE hMapFile = CreateFileMapping(hFile, NULL, PAGE_READWRITE, 0, 0, NULL);if(hMapFile == NULL){
fprintf(stderr, "Could not create file mapping object: %x\n", GetLastError());
exit(__LINE__);}
char* lpMapAddress = MapViewOfFile(hMapFile, FILE_MAP_WRITE, 0, 0, 0);if(lpMapAddress == NULL){
fprintf(stderr, "Could not map view of file: %x\n", GetLastError());
exit(__LINE__);}for(i =6 * MB; i <10 * MB; i++){((char*)lpMapAddress)[i]++;
}
if (!FlushViewOfFile(lpMapAddress,0)){
fprintf(stderr, "Could not flush view of file: %x\n", GetLastError());
exit(__LINE__);}if(!FlushFileBuffers(hFile)){
fprintf(stderr, "Could not flush file buffers: %x\n", GetLastError());
exit(__LINE__);}
The end for this file is:
So with the other file, we have a total of 24 MB in use on a 32 MB disk. And here is the state of the disk itself:
The problem is that there used to be 9.78 MB that were busy when we had a newly formatted disk. And now we are using at least some of that disk space for storing file data somehow.
I’m getting the same behavior when I use normal file I/O:
moveAmount.QuadPart =6 * MB;
SetFilePointerEx(hFile, moveAmount, NULL, FILE_BEGIN);for(i =6; i <10; i++){if(!WriteFile(hFile, buffer, MB, &bytesWritten, NULL)){
fprintf(stderr, "WriteFile failed on iteration %d: %x\n", i, GetLastError());
exit(__LINE__);}}
So somehow in this sequence of operations, we get more disk space. On the other hand, if I try to write just 22 MB into a single file, it fails. See:
hFile = CreateFileA("R:/original_file.bin", GENERIC_READ | GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);if(hFile == INVALID_HANDLE_VALUE){
printf("Error creating file: %d\n", GetLastError());
exit(__LINE__);}for(int i =0; i <22; i++){if(!WriteFile(hFile, buffer, MB, &bytesWritten, NULL)){
fprintf(stderr, "WriteFile failed on iteration %d: %x\n", i, GetLastError());
exit(__LINE__);}}
You can find the full source here. I would love to understand what exactly is happening and how we suddenly get more disk space usage in this scenario.
Today I set out to figure out an answer to a very specific question. What happens at the OS level when you try to allocate disk space for a sparse file and there is no additional disk space?
Sparse files are a fairly advanced feature of file systems. They allow you to define a file whose size is 10GB, but that only takes 2GB of actual disk space. The rest is sparse (takes no disk space and on read will return just zeroes). The OS will automatically allocate additional disk space for you if you write to the sparse ranges.
This leads to an interesting question, what happens when you write to a sparse file if there is no additional disk space?
Let’s look at the problem on Linux first. We define a RAM disk with 32MB, like so:
As expected, this code will fail on the 5th write (since there is no disk space to allocate in the disk). The error would be:
Write error: errno = 28 (No space left on device)
Here is what the file system reports:
$ du -h /mnt/ramdisk/*4.0M /mnt/ramdisk/anotherfile
28M /mnt/ramdisk/fullfile
$ ll -h /mnt/ramdisk/
total 33M
drwxrwxrwt 2 root root 80 Jan 910:43./
drwxr-xr-x 6 root root 4.0K Jan 910:30../-rw-r--r--1 ayende ayende 4.0M Jan 910:43 anotherfile
-rw-r--r--1 ayende ayende 32M Jan 910:43 fullfile
As you can see, we have a total of 32 MB of actual size reported, but ll is reporting that we actually have files bigger than that (because we have hole punching).
What would happen if we were to run this using memory-mapped I/O? Here is the code:
This will lead to an interesting scenario. We need to allocate disk space for the memory, and we’ll do so (note that we are writing into the hole), and this code will fail with a segmentation fault.
It will fail in the loop, by the way, as part of the page fault to bring the memory in, the file system needs to allocate the disk space. If there is no such disk space, it will fail. The only way for the OS to behave in this case is to fail the write, which leads to a segmentation fault.
I also tried that on Windows. I defined a virtual disk like so:
This creates a 32MB disk and assigns it the letter R. Note that we are using NTFS, which has its own metadata, we have roughly 21MB or so of usable disk space to play with here.
Here is the Windows code that simulates the same behavior as the Linux code above:
#include <stdio.h>#include <windows.h>#define MB (1024 * 1024)
int main(){
HANDLE hFile, hFile2;
DWORD bytesWritten;
LARGE_INTEGER fileSize, moveAmount;
char* buffer = malloc(MB);
int i;
DeleteFileA("R:\\original_file.bin");
DeleteFileA("R:\\another_file.bin");
hFile = CreateFileA("R:/original_file.bin", GENERIC_READ | GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);if(hFile == INVALID_HANDLE_VALUE){
printf("Error creating file: %d\n", GetLastError());
exit(__LINE__);}for(int i =0; i <20; i++){if(!WriteFile(hFile, buffer, MB, &bytesWritten, NULL)){
fprintf(stderr, "WriteFile failed on iteration %d: %x\n", i, GetLastError());
exit(__LINE__);}if(bytesWritten != MB){
fprintf(stderr, "Failed to write full buffer on iteration %d\n", i);
exit(__LINE__);}}
FILE_ZERO_DATA_INFORMATION zeroDataInfo;
zeroDataInfo.FileOffset.QuadPart =6 * MB;
zeroDataInfo.BeyondFinalZero.QuadPart =18 * MB;if(!DeviceIoControl(hFile, FSCTL_SET_SPARSE, NULL, 0, NULL, 0, NULL, NULL)||!DeviceIoControl(hFile, FSCTL_SET_ZERO_DATA, &zeroDataInfo, sizeof(zeroDataInfo), NULL, 0, NULL, NULL)){
printf("Error setting zero data: %d\n", GetLastError());
exit(__LINE__);}
// Create another file of size 4 MB
hFile2 = CreateFileA("R:/another_file.bin", GENERIC_READ | GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);if(hFile2 == INVALID_HANDLE_VALUE){
printf("Error creating second file: %d\n", GetLastError());
exit(__LINE__);}for(int i =0; i <4; i++){if(!WriteFile(hFile2, buffer, MB, &bytesWritten, NULL)){
fprintf(stderr, "WriteFile 2 failed on iteration %d: %x\n", i, GetLastError());
exit(__LINE__);}if(bytesWritten != MB){
fprintf(stderr, "Failed to write full buffer 2 on iteration %d\n", i);
exit(__LINE__);}}
moveAmount.QuadPart =12 * MB;
SetFilePointerEx(hFile, moveAmount, NULL, FILE_BEGIN);for(i =0; i <8; i++){if(!WriteFile(hFile, buffer, MB, &bytesWritten, NULL)){
printf("Error writing to file: %d\n", GetLastError());
exit(__LINE__);}}return0;}
And that gives us the exact same behavior as in Linux. One of these writes will fail because there is no more disk space for it. What about when we use memory-mapped I/O?
I didn’t bother checking Mac or BSD, but I’m assuming that they behave in the same manner. I can’t conceive of anything else that they could reasonably do.
In my previous post, I discussed how Linux will silently truncate a big write (> 2 GB) for you. That is expected by the interface of write(). The problem is that this behavior also applies when you use IO_Uring.
If BUFFER_SIZE is 3 GB, then this will write about 2 GB to the file. The number of bytes written is correctly reported, but the complexity this generates is huge. Consider the following function:
In other words, send all the data to the IO Ring, then wait for all those operations to complete. We verify complete success and can then move on. However, because we may have a write that is greater than 2 GB, and because the interface allows the IO Uring to write less than we thought it would, we need to handle that with retries.
After thinking about this for a while, I came up with the following implementation:
int32_t _submit_writes_to_ring(
struct handle *handle,
int32_t count,
struct page_to_write *buffers,
int32_t* detailed_error_code){
struct io_uring *ring =&handle->global_state->ring;
off_t *offsets = handle->global_state->offsets;
memset(offsets, 0, count * sizeof(off_t));
while(true){
int32_t submitted =0;for(size_t i =0; i < count; i++){
off_t offset = offsets[i];
if(offset == buffers[i].count_of_pages * VORON_PAGE_SIZE)continue;
struct io_uring_sqe *sqe = io_uring_get_sqe(ring);if(sqe == NULL) // the ring is full, flush it...
break;
io_uring_sqe_set_data(sqe, i);
io_uring_prep_write(sqe, handle->file_fd,
buffers[i].ptr + offset,
buffers[i].count_of_pages * VORON_PAGE_SIZE - offset,
buffers[i].page_num * VORON_PAGE_SIZE + offset);
submitted++;}
if(submitted ==0)return SUCCESS;
int32_t rc = io_uring_submit_and_wait(ring, submitted);
if(rc <0){
*detailed_error_code = -rc;return FAIL_IO_RING_SUBMIT;}
struct io_uring_cqe *cqe;
uint32_t head=0;
uint32_t i =0;
bool has_errors =false;
io_uring_for_each_cqe(ring, head, cqe){
i++;
uint64_t index = io_uring_cqe_get_data64(cqe);
int result = cqe->res;
if(result <0){
has_errors =true;
*detailed_error_code = -result;}else{
offsets[index]+= result;
if(result ==0){
// there shouldn't be a scenario where we return0
// for a write operation, we may want to retry here
// but figuring out if this is a single happening, of if
// we need to retry this operation (_have_ retried it?) is
// complex enough to treat this as an error for now.
has_errors =true;
*detailed_error_code = EIO;}}}
io_uring_cq_advance(ring, i);
if(has_errors)return FAIL_IO_RING_WRITE_RESULT;}}
That is a lot of code, but it is mostly because of how C works. What we do here is scan through the buffers we need to write, as well as scan through an array of offsets that store additional information for the operation.
If the offset to write doesn’t indicate that we’ve written the whole thing, we’ll submit it to the ring and keep going until we either fill the entire ring or run out of buffers to work with. The next step is to submit the work and wait for it to complete, then run through the results, check for errors, and update the offset that we wrote for the relevant buffer.
Then, we scan the buffers array again to find either partial writes that we have to complete (we didn’t write the whole buffer) or buffers that we didn’t write at all because we filled the ring. In either case, we submit the new batch of work to the ring and repeat until we run out of work.
This code assumes that we cannot have a non-error state where we write 0 bytes to the file and treats that as an error. We also assume that an error in writing to the disk is fatal, and the higher-level code will discard the entire IO_Uring if that happens.
The Windows version, by the way, is somewhat simpler. Windows explicitly limits the size of the buffer you can pass to the write() call (and its IO Ring equivalent). It also ensures that it will write the whole thing, so partial writes are not an issue there.
It is interesting to note that the code above will effectively stripe writes if you send very large buffers. Let’s assume that we send it two 4 GB buffers, like so:
Offset
Size
Buffer 1
1 GB
4 GB
Buffer 2
10 GB
6 GB
The patterns of writes that will actually be executed are:
1GB .. 3 GB, 10 GB .. 12 GB
3 GB .. 5 GB, 12 GB .. 14 GB
14 GB .. 16 GB
I can “fix” that by never issuing writes that are larger than 2 GB and issuing separate writes for each 2 GB range, but that leads to other complexities (e.g., tracking state if I split a write and hit the full ring status, etc.). At those sizes, it doesn’t actually matter in terms of efficiency or performance.
Partial writes are almost always a sign of either very large writes that were broken up or some underlying issue that is already problematic, so I don’t care that much about that scenario in general. For the vast majority of cases, this will always issue exactly one write for each buffer.
What is really interesting from my point of view, however, is how even a pretty self-contained feature can get pretty complex internally. On the other hand, this behavior allows me to push a whole bunch of work directly to the OS and have it send all of that to the disk as fast as possible.
In our scenarios, under load, we may call that with thousands to tens of thousands of pages (each 8 KB in size) spread all over the file. The buffers are actually sorted, so ideally, the kernel will be able to take advantage of that, but even if not, just reducing the number of syscalls will result in performance savings.
I previously asked what the code below does, and mentioned that it should give interesting insight into the kind of mindset and knowledge a candidate has. Take a look at the code again:
~$ head large_file.bin | hexdump -C
0000000000000000000000000000000000000000|................|*
7ffff000
The question is why? And the answer is quite simple. Linux has a limitation of about 2 GB for writes to the disk. Any write call that attempts to write more than that will only write that much, and you’ll have to call the system again. This is not an error, mind. The write call is free to write less than the size of the buffer you passed to it.
Windows has the same limit, but it is honest about it
In Windows, all write calls accept a 32-bit int as the size of the buffer, so this limitation is clearly communicated in the API. Windows will also ensure that for files, a WriteFile call that completes successfully writes the entire buffer to the disk.
And why am I writing 2 GB of zeros? In the code above, I’m using malloc(), not calloc(), so I wouldn’t expect the values to be zero. Because this is a large allocation, malloc() calls the OS to provide us with the buffer directly, and the OS is contractually obligated to provide us with zeroed pages.
This shows both surprising behavior and serves as a good opening for discussion on a whole bunch of issues. In an interview setting, that can give us a lot of insight into the sort of knowledge a candidate has.
The problem was that this took time - many days or multiple weeks - for us to observe that. But we had the charts to prove that this was pretty consistent. If the RavenDB service was restarted (we did not have to restart the machine), the situation would instantly fix itself and then slowly degrade over time.
The scenario in question was performance degradation over time. The metric in question was the average request latency, and we could track a small but consistent rise in this number over the course of days and weeks. The load on the server remained pretty much constant, but the latency of the requests grew.
That the customer didn’t notice that is an interesting story on its own. RavenDB will automatically prioritize the fastest node in the cluster to be the “customer-facing” one, and it alleviated the issue to such an extent that the metrics the customer usually monitors were fine. The RavenDB Cloud team looks at the entire system, so we started the investigation long before the problem warranted users’ attention.
I hate these sorts of issues because they are really hard to figure out and subject to basically every caveat under the sun. In this case, we basically had exactly nothing to go on. The workload was pretty consistent, and I/O, memory, and CPU usage were acceptable. There was no starting point to look at.
Those are also big machines, with hundreds of GB of RAM and running heavy workloads. These machines have great disks and a lot of CPU power to spare. What is going on here?
After a long while, we got a good handle on what is actually going on. When RavenDB starts, it creates memory maps of the file it is working with. Over time, as needed, RavenDB will map, unmap, and remap as needed. A process that has been running for a long while, with many databases and indexes operating, will have a lot of work done in terms of memory mapping.
In Linux, you can inspect those details by running:
Here you can see the first page of entries from this file. Just starting up RavenDB (with no databases created) will generate close to 2,000 entries. The smaps virtual file can be really invaluable for figuring out certain types of problems. In the snippet above, you can see that we have some executable memory ranges mapped, for example.
The problem is that over time, memory becomes fragmented, and we may end up with an smaps file that contains tens of thousands (or even hundreds of thousands) of entries.
Here is the result of running perf top on the system, you can see that the top three items that hogs most of the resources are related to smaps accounting.
This file provides such useful information that we monitor it on a regular basis. It turns out that this can have… interesting effects. Consider that while we are running the scan through all the memory mapping, we may need to change the memory mapping for the process. That leads to contention on the kernel locks that protect the mapping, of course.
It’s expensive to generate the smaps file
Reading from /proc/[pid]/smaps is not a simple file read. It involves the kernel gathering detailed memory statistics (e.g., memory regions, page size, resident/anonymous/shared memory usage) for each virtual memory area (VMA) of the process. For large processes with many memory mappings, this can be computationally expensive as the kernel has to gather the required information every time /proc/[pid]/smaps is accessed.
When /proc/[pid]/smaps is read, the kernel needs to access memory-related structures. This may involve taking locks on certain parts of the process’s memory management system. If this is done too often or across many large processes, it could lead to contention or slow down the process itself, especially if other processes are accessing or modifying memory at the same time.
If the number of memory mappings is high, and the frequency with which we monitor is short… I hope you can see where this is going. We effectively spent so much time running over this file that we blocked other operations.
This wasn’t an issue when we just started the process, because the number of memory mappings was small, but as we worked on the system and the number of memory mappings grew… we eventually started hitting contention.
The solution was two-fold. We made sure that there is only ever a single thread that would read the information from the smaps (previously it might have been triggered from multiple locations). We added some throttling to ensure that we aren’t hammering the kernel with requests for this file too often (returning cached information if needed) and we switched from using smaps to using smaps_rollup instead. The rollup version provides much better performance, since it deals with summary data only.
With those changes in place, we deployed to production and waited. The result was flat latency numbers and another item that the Cloud team could strike off the board successfully.
The task is to build a Work Breakdown Structure, where you have:
Projects
Major deliverables
Sub-deliverables
Work packages
The idea is to be able to track EstimatedHours and CompletedHours across the entire tree. For example, let’s say that I have the following:
Project: Bee Keeper Chronicle App
Major deliverable: App Design
Sub-deliverable: Wireframes all screens
Work Package: Login page wireframe
Users can add the EstimatedHours and CompletedHours at any level, and we want to be able to aggregate the data upward. So the Project: “Bee Keeper Chronicle App” should have a total estimated time and the number of hours that were worked on.
The question is how to model & track that in RavenDB. Here is what I think the document structure should look like:
I used a Parent relationship, since that is the most flexible (it allows you to move each item to a completely different part of the tree easily). Now, we need to aggregate the values for all of those items using a Map-Reduce index.
Because of the similar structure, I created the following JS function:
functionprocessWorkBreakdownHours(doc){let hours ={EsimatedHours: doc.EsimatedHours,CompletedHours: doc.CompletedHours
};let results =[Object.assign({Scope:id(doc)}, hours)];let current = doc;while(current.Parent){
current =load(current.Parent.Id, current.Parent.Type);
results.push(Object.assign({Scope:id(current)}, hours));}return results;}
This will scan over the hierarchy and add the number of estimated and completed hours to all the levels. Now we just need to add this file as Additional Sources to an index and call it for all the relevant collections, like this:
The end result is automatic aggregation at all levels. Change one item, and it will propagate upward.
Considerations:
I’m using load() here, which creates a reference from the parent to the child. The idea is that if we move a Work Package from one Sub-deliverable to another (in the same or a different Major & Project), this index will automatically re-index what is required and get you the right result.
However, that also means that whenever the Major document changes, we’ll have to re-index everything below it (because it might have changed the Project). There are two ways to handle that.
Instead of using load(), we’ll use noTracking.load(), which tells RavenDB that when the referenced document changes, we should not re-index.
Provide the relevant scopes at the document level, like this:
I need to query over a time span, either known (start, end) or (start, $currentDate), and I need to be able to sort on them.
That might sound… vague, I know. A better way to explain this is that I have a list of people, and I need to sort them by their age. That’s trivial to do since I can sort by the birthday, right? The problem is that we include some historical data, so some people are deceased.
Basically, we want to be able to get the following data, sorted by age ascending:
NameBirthdayDeath
Michael Stonebraker
1943
N/A
Sir Tim Berners-Lee
1955
N/A
Narges Mohammadi
1972
N/A
Sir Terry Prachett
1948
2015
Agatha Christie
1890
1976
This doesn’t look hard, right? I mean, all you need to do is something like:
order by datediff( coalesce(Death, now()), Birthday )
Easy enough, and would work great if you have a small number of items to sort. What happens if we want to sort over 10M records?
Look at the manner in which we are ordering, that will require us to evaluate each and every record. That means we’ll have to scan through the entire list and sort it. This can be really expensive. And because we are sorting over a date (which changes), you can’t even get away with a computed field.
RavenDB will refuse to run queries that can only work with small amounts of data but will fail as the data grows. This is part of our philosophy, saying that things should Just Work. Of course, in this case, it doesn’t work, so the question is how this aligns with our philosophy?
The idea is simple. If we cannot make it work in all cases, we will reject it outright. The idea is to ensure that your system is not susceptible to hidden traps. By explicitly rejecting it upfront, we make sure that you’ll have a good solution and not something that will fail as your data size grows.
What is the appropriate behavior here, then? How can we make it work with RavenDB?
The key issue is that we want to be able to figure out what is the value we’ll sort on during the indexing stage. This is important because otherwise we’ll have to compute it across the entire dataset for each query. We can do that in RavenDB by exposing that value to the index.
We cannot just call DateTime.Today, however. That won’t work when the day rolls over, of course. So instead, we store that value in a document config/current-date, like so:
Once this is stored as a document, we can then write the following index:
from p in docs.People
let end = p.Death ?? LoadDocument("config/current-date","Config").Date
select new
{
Age = end - p.Birthday
}
And then query it using:
from index 'People/WithAge'
order by Age desc
That works beautifully, of course, until the next day. What happens then? Well, we’ll need to schedule an update to the config/current-date document to correct the date.
At that point, because there is an association created between all the documents that loaded the current date, the indexing engine in RavenDB will go and re-index them. The idea is that at any given point in time, we have already computed the value, and can run really quick queries and sort on it.
When you update the configuration document, it is a signal that we need to re-index the referencing documents. RavenDB is good at knowing how to do that on a streaming basis, so it won’t need to do a huge amount of work all at once.
You’ll also note that we only load the configuration document if we don’t have an end date. So the deceased people’s records will not be affected or require re-indexing.
In short, we can benefit from querying over the age without incurring query time costs and can defer those costs to background indexing time. The downside is that we need to set up a cron job to make it happen, but that isn’t too big a task, I think.
You can utilize similar setups for other scenarios where you need to query over changing values. The performance benefits here are enormous. And what is more interesting, even if you have a huge amount of data, this approach will just keep on ticking and deliver great results at very low latencies.
I got into an interesting discussion on LinkedIn about my previous post, talking about Code Rot. I was asked about Legacy Code defined as code without tests and how I reconcile code rot with having tests.
I started to reply there, but it really got out of hand and became its own post.
I read Working Effectively with Legacy Code for the first time in 2005 or thereabout, I think. It left a massive impression on me and on the industry at large. The book is one of the reasons I started rigorously writing tests for my code, it got me interested in mocking and eventually led me to writing Rhino Mocks.
It is ironic that the point of this post is that I disagree with this statement by Michael because of Rhino Mocks. Let’s start with numbers, last commit to the Rhino Mocks repository was about a decade ago. It has just under 1,000 tests and code coverage that ranges between 95% - 100%.
I can modify this codebase with confidence, knowing that I will not break stuff unintentionally. The design of the code is very explicitly meant to aid in testing and the entire project was developed with a Test First mindset.
I haven’t touched the codebase in a decade (and it has been close to 15 years since I really delved into it). The code itself was written in .NET 1.1 around the 2006 timeframe. It literally predates generics in .NET.
It compiles and runs all tests when I try to run it, which is great. But it is still very much a legacy codebase.
It is a legacy codebase because changing this code is a big undertaking. This code will not run on modern systems. We need to address issues related to dynamic code generation between .NET Framework and .NET.
That in turn requires a high level of expertise and knowledge. I’m fairly certain that given enough time and effort, it is possible to do so. The problem is that this will now require me to reconstitute my understanding of the code.
The tests are going to be invaluable for actually making those changes, but the core issue is that a lot of knowledge has been lost. It will be a Project just to get it back to a normative state.
This scenario is pretty interesting because I am actually looking back at my own project. Thinking about having to do the same to a similar project from someone else’s code is an even bigger challenge.
Legacy code, in this context, means that there is a huge amount of effort required to start moving the project along. Note that if we had kept the knowledge and information within the same codebase, the same process would be far cheaper and easier.
Legacy code isn’t about the state of the codebase in my eyes, it is about the state of the team maintaining it. The team, their knowledge, and expertise, are far more important than the code itself.
An orphaned codebase, one that has no one to take care of, is a legacy project even if it has tests. Conversely, a project with no tests but with an actively knowledgeable team operating on it is not.
Note that I absolutely agree that tests are crucial regardless. The distinction that I make between legacy projects and non-legacy projects is whether we can deliver a change to the system.
Reminder: A codebase that isn’t being actively maintained and has no tests is the worst thing of all. If you are in that situation, go read Working Effectively with Legacy Code, it will be a lifesaver.
I need a feature with an ideal cost of X (time, materials, effort, cost, etc). A project with no tests but people familiar with it will be able to deliver it at a cost of 2-3X. A legacy project will need 10X or more. The second feature may still require 2X from the maintained project, but only 5X from the legacy system. However, that initial cost to get things started is the killer.
In other words, what matters here is the inertia, the ability to actually deliver updates to the system.
I’m currently deep in the process of modifying the internals of Voron, trying to eke out more performance out of the system. I’m making great progress, but I’m also touching parts of the code that haven’t even been looked at for a long time.
In other words, I’m mucking about with the most stable and most critical portions of the storage engine. It’s a lot of fun, and I’m actually seeing some great results, but it is also nerve-wracking.
We have enough tests that I’ve great confidence I would catch any actual stability issues, but the drive back toward a fully green build has been a slog.
The process is straightforward:
Change something.
Verify that it works better than before.
Run the entire test suite (upward of 30K tests) to see if there are any breaks.
The last part can be frustrating because it takes a while to run this sort of test suite. That would be bad enough, but some of the changes I made were things like marking a piece of memory that used to be read/write as read-only. Now any access to that memory would result in an access violation.
I fixed those in the code, of course, but we have a lot of tests, including some tests that intentionally corrupt data to verify that RavenDB behaves properly under those conditions.
One such test writes garbage to the RavenDB file, using read-write memory. The idea is to verify that the checksum matches on read and abort early. Because that test directly modifies what is now read-only memory, it generates a crash due to a memory access violation. That doesn’t just result in a test failure, it takes the whole process down.
I’ve gotten pretty good at debugging those sorts of issues (--blame-crash is fantastic) and was able to knock quite a few of them down and get them fixed.
And then there was this test, which uses encryption-at-rest. That test started to fail after my changes, and I was pretty confused about exactly what was going on. When trying to read data from disk, it would follow up a pointer to an invalid location. That is not supposed to happen, obviously.
Looks like I have a little data corruption issue on my hands. The problem is that this shouldn’t be possible. Remember how we validate the checksum on read? When using encryption-at-rest, we are using a mechanism called AEAD (Authenticated Encryption with Associated Data). That means that in order to successfully decrypt a page of data from disk, it must have been cryptographically verified to be valid.
My test results showed, pretty conclusively, that I was generating valid data and then encrypting it. The next stage was to decrypt the data (verifying that it was valid), at which point I ended up with complete garbage.
RavenDB trusts that since the data was properly decrypted, it is valid and tries to use it. Because the data is garbage, that leads to… excitement. Once I realized what was going on, I was really confused. I’m pretty sure that I didn’t break 256-bit encryption, but I had a very clear chain of steps that led to valid data being decrypted (successfully!) to garbage.
It was also quite frustrating to track because any small-stage test that I wrote would return the expected results. It was only when I ran the entire system and stressed it that I got this weird scenario.
I started practicing for my Fields medal acceptance speech while digging deeper. Something here had to be wrong. It took me a while to figure out what was going on, but eventually, I tracked it down to registering to the TransactionCommit event when we open a new file.
The idea is that when we commit the transaction, we’ll encrypt all the data buffers and then write them to the file. We register for an event to handle that, and we used to do that on a per-file basis. My changes, among other things, moved that logic to apply globally.
As long as we were writing to a single file, everything just worked. When we had enough workload to need a second file, we would encrypt the data twice and then write it to the file. Upon decryption, we would successfully decrypt the data but would end up with still encrypted data (looking like random fluff).
The fix was simply moving the event registration to the transaction level, not the file level. I committed my changes and went back to the unexciting life of bug-fixing, rather than encryption-breaking and math-defying hacks.