With F#, the list often feels like the default choice of data structure. It is immutable and hence easy to reason about, however its use can come at a great cost. If you are using lists to process large amounts of data, then a lot of time will be spent creating objects and garbage collecting. When I stress tested lists with F#, it became clear that this can cause a major synchronization issue when parallel processing. In fact, I was often seeing programs written in parallel which were effectively just context switching between threads rather than running in parallel.
Keep the immutable lists but use a serialization strategy to bring data in and out of a program that runs as a single processes. Then use another program to distribute that data to each program and run them in parallel.
Switch to mutable data structures to reduce the synchronization problems and time spent creating objects and garbage collecting.
My preference is for immutable data structures, however serializing data in and out can be slow depending on what is being processed and this ugly hack removed much of the beauty and transparency of using immutable data structures. I therefore decided to investigate the effect of switching to mutable data structures to see what a difference this would make. The difference was huge.
An Example Program Using Immutable Lists
The script below (program-noclass-list.fsx) is a very basic data categorizer which relies on a field having a certain value as a rule to determine if this will predict a category.
I made two main improvements, both involving moving to arrays. The programs can be seen in full at the end of this article and the benchmarks in the next section.
First I switched all the lists to arrays. This would reduce the amount of memory allocation and garbage collection and hence synchronization issues. Although arrays are mutable, I used them in an immutable way.
Secondly I made the record field for the arrays mutable and again reduced the time creating objects and garbage collecting.
The switch to mutable arrays made a big difference, but I wanted to regain some control of the mutability and hence created a class to improve access control.
You can see below that the biggest difference was made by switching from lists to arrays and that still more could be done by moving to mutable arrays.
For those interested here are the benchmarks for the same programs run on a single thread (by changing the numSplits line to equal 1). This demonstrates the problem with program-noclass-list.fsx as it actually runs quicker with a single thread than it does in parallel. It also shows the problem with lists in general where performance is an issue.
It is important to pick the correct data structures for the job at hand. As Wirth said “Algorithms + Data Structures = Programs”. While lists are a great default data structure in F#, where performance is an issue it can be seen how important it is to look beyond this, particularly when using multiple threads.