
Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: March 18, 2024
In this tutorial, we’ll explore methods to create a random sample of a fixed size from Scala List. For example, if we need to select some five arbitrary elements from the list: List[Int] = List(0,1,2,3,4,5,6,7,8,9), the result might resemble List(8, 0, 3, 7, 5).
In Scala, we often adhere to the functional style when working with collections, employing combinators like map and flatMap. Essentially, we traverse the collection and update each value. However, when it comes to accessing some random elements, traversing the list for updates may perform pretty weakly.
For the purpose of illustration, we can address the problem using tail recursion:
def getRandomSampleRec[T](list: List[T], size: Int): List[T] = {
@tailrec
def rec(xs: List[T], acc: List[T]): List[T] = {
if (acc.size == size) {
acc
} else {
val index = Random.nextInt(xs.size)
val (left, right) = xs.splitAt(index)
val (xsUpd, next) = if (right.nonEmpty) {
(left ::: right.tail, right.head)
} else {
(left.dropRight(1), left.tail.last)
}
rec(xsUpd, next :: acc)
}
}
if (size == 0) {
List.empty[T]
} else if (size > list.size) {
list
} else {
rec(list, List.empty[T])
}
}
Let’s review the key steps of the solution:
It’s important to note that here, we impose a significant burden on performance. This is due to the fact that splitAt and list concatenation (:::) both have O(N) time complexity and, since we repeat these operations size times, the overall time complexity becomes O(size*N). Utilizing this method would be appropriate only for small samples.
Among other purely functional solutions, we might consider the following approach:
def getRandomSampleZip[T](list: List[T], size: Int): List[T] =
list
.map(elem => (Random.nextInt(), elem))
.sortBy(_._1)
.map(_._2)
.take(size)
Let’s break down the process:
By following this approach, the time complexity is O(N*log(N)), which corresponds to the complexity of the sorting operation.
We can leverage a standard Scala library method — shuffle from scala.util.Random — to mix elements of the List. Then, we take only the desired number of elements:
def getRandomSampleShuffle[T](list: List[T], size: Int): List[T] =
Random.shuffle(list).take(size)
Random.shuffle uses the mutable.ArrayBuffer data structure under the hood, which enhances the performance of element access by index.
This approach boasts a time complexity of O(N) since the shuffle operation traverses the list only once. In the worst case, the list will be traversed twice, if the size of the randomized sample matches the size of the initial list (as seen when take is called). Furthermore, this solution is notably elegant.
For testing purposes, we’ll extract samples from a sorted List of elements ranging from 0 to 99. Throughout the testing phase, our objectives include ensuring that the resulting samples’ elements are both shuffled (and hence not sorted) and non-repeating. Additionally, we must confirm that the sample sizes are accurate. As an appendix, we may also ensure which method performs the best:
"RandomFixedSizeSample" should {
"create a random sample out of the initial List" in {
val list = List.range(0, 100)
val sampleSize = 10
val list_0 = RandomFixedSizeSample.getRandomSampleRec(list, sampleSize)
val list_1 = RandomFixedSizeSample.getRandomSampleZip(list, sampleSize)
val list_2 = RandomFixedSizeSample.getRandomSampleShuffle(list, sampleSize)
list_0.size shouldBe sampleSize
list_1.size shouldBe sampleSize
list_2.size shouldBe sampleSize
list_0.toSet.size shouldBe list_0.size
list_1.toSet.size shouldBe list_1.size
list_2.toSet.size shouldBe list_2.size
isSorted(list_0) shouldBe false
isSorted(list_1) shouldBe false
isSorted(list_2) shouldBe false
}
"ensure getRandomSampleShuffle is the most performant, then goes getRandomSampleZip and then getRandomSampleRec" in {
val list = List.range(0, 10_000)
val sampleSize = 100
val start_0 = System.nanoTime()
RandomFixedSizeSample.getRandomSampleRec(list, sampleSize)
val end_0 = System.nanoTime()
val duration_0 = end_0 - start_0
val start_1 = System.nanoTime()
RandomFixedSizeSample.getRandomSampleZip(list, sampleSize)
val end_1 = System.nanoTime()
val duration_1 = end_1 - start_1
val start_2 = System.nanoTime()
RandomFixedSizeSample.getRandomSampleShuffle(list, sampleSize)
val end_2 = System.nanoTime()
val duration_2 = end_2 - start_2
duration_0 should be > duration_1
duration_1 should be > duration_2
}
As we can see, leveraging scala.util.Random method shuffle leads to the best performance and conciseness. Additionally, we’ve observed that zipping collections with random numbers is also pretty powerful.