Performance

The following benchmarks show the performance of sift. While speed is not everything, it is quite a difference whether you get your results after 2 seconds or 2 minutes, especially when you are grepping through many files over and over (e.g. when searching in large source code repositories or log files).
These results were achieved even though sift introduces new features that other tools do not have, like conditions and multiline matching.

In this comparision, all tools were configured to search the complete test data. Of course sift too can be configured to only search in specific paths/files/etc., but the aim was being fast while searching everywhere.

All searches were performed with the complete data files cached by the operating system. Three runs per test were performed and the best result was taken.
The weblog searches were done on a large server, while the rest was done on a desktop system.


Benchmark grep ack ag pt sift
Web log files search
This search simulated searching for a specific pattern in web logs.
The search was performed over 35GB data, split over 32 files, using the pattern 'IntWebApp.*ParamName'. The logs were synthesized from real logs, and there was one valid match to find ('IntWebApp' was part of the logged URL path, while 'ParamName' was a query parameter).
23.630s
40.81x
226.154s
390.59x
4.665s
8.06x
222.487s
384.26x
0.579s
1x
Web log files search for 10 strings in parallel
In this search, the same data as above was used, but the search was done for 10 static strings listed in a file. Some tools do not support searching for multiple patterns in parallel, here the result for a single search x 10 was taken as result.
148.497s
27.30x
(~2261s)
415.70x
(~46s)
8.46x
(~2224s)
408.90x
5.439s
1x
Linux source code
Listing all exported crypto symbols with line numbers - searching for "EXPORT_SYMBOL_GPL.*crypto" in the Linux kernel version 3.18.2 (637 MB).
0.747s
1.75x
25.929s
60.87x
1.140s
2.67x
17.840s
41.88x
0.426s
1x
Wordlist search
Searching a large wordlist (1.8 GB, used in password cracking attempts) for all word variations containing 'qwerty' (returning 2722 results).
1.285
1.74x
155.801s
211.11x
6.484s
8.79x
24.651s
33.402x
0.738s
1x
Userlist search (ignore case)
Searching a list of usernames (125 MB) for all name variations containing 'grep' (ignore case, returning 121 results).
0.381s
1.36x
10.050s
35.89x
0.497s
1.78x
7.485s
26.73x
0.280s
1x
Log search
This search was done over DNS logs (228 files, 15 MB). Some of the log files were already gzip'ed and the newest files were not.
For grep, two calls had to be made: grep for the normal files and zgrep for the gzip'ed files.
1.465s
6.37x
- 0.485s
2.11x
- 0.230s
1x

Notes on the Benchmarks

The purpose of these benchmarks is to show that sift can compete with other tools concerning performance and that there are several use cases where sift is ahead of other tools. Of course these benchmarks show use cases in which sift is especially good - there will be cases where sift is not the perfect tool for the job.

As there have been quite a few accusations that these tests are made up, I searched for useful publicly available test data and created some scripts so that everybody can reproduce the test results. I cannot share the test data I used, and I think using publicly available data ensures that I did not make up test data specifically tailored so that sift performs better than other tools.

The benchmarks in which sift is much faster than ag (the silver searcher) received the most criticism, so here are the steps to reproduce those test results.


Wordlist Search

This one is rather simple to reproduce, you just need a large wordlist. You can download one here (crackstation-human-only.txt.gz) via BitTorrent. Unpacked that list has a size of 684 MB.

Searching for variations containing 'qwerty' yields the following results:

  $ time ag -s --no-numbers qwerty crackstation-human-only.txt | wc
  2993    2995   35555
  
  real 0m2.488s
  user 0m2.422s
  sys  0m0.061s
  
  $ time sift qwerty crackstation-human-only.txt | wc
  2993    2995   35555
  
  real  0m0.327s
  user  0m0.159s
  sys 0m0.230s

sift takes 0.327 seconds while ag takes 2.488 seconds. These results were taken on a system with an AMD Phenom II 1100T processor with all data cached in RAM. Only one CPU core was used as only one file was searched.


Web Log Files Search

As stated above, this benchmark was performed on a large server with many CPUs and enough RAM to cache the whole test data. This was done to show sift's ability to search concurrently and to scale linearly with the number of available CPUs.

A typical use case for this are forensic analyses of log data which are often done on large servers to achieve maximum performance.

I found some public web logs here - the sample for July contains about 200 MB of weblogs.

The test scripts duplicate this data in 32 files (each about 1GB in size) and add one line (that is searched for) to one log file (randomly selected).

The benchmarks can be reproduced using Amazon AWS: create a new c3.8xlarge instance (a c3.4xlarge instance should yield similar results) and select "Ubuntu Server 14.04 LTS (HVM), SSD Volume Type - ami-d05e75b8" as image. The following commands then run the benchmarks.

cd /tmp/

wget https://sift-tool.org/downloads/benchmark/{install,prepare-data,test}.sh
chmod +x *.sh

./install.sh

sudo mkdir /mnt/testdata
sudo chown ubuntu /mnt/testdata

./prepare-data.sh /mnt/testdata/

cd /mnt/testdata/
/tmp/test.sh

Video Showing the Web Log Files Search

I also created a video to show what the output looks like. The creation of the 32 test files takes some time, so you might want to skip that. The benchmarks of pt and ack are not shown as they take several minutes.