Fast Search Indexing

Here at DoneSafe, we allow users to search for other users when entering forms to select things like, who reported an incident, who was involved, etc. We have received some complaints that the performance of this was slow, and did not allow partial matching. In this post we will explore a couple different strategies on reimplementing this search. We end up implementing a GIN index using Postgres Trigrams. I also show how to do this in a multi tenant environment using Apartment.

Setup

I wanted to test some different strategies locally, so I went ahead and created around 100k users on my machine to try to replicate some real production level stress.

We currently store names as first_name and last_name. We want to support the ability for someone to search either, or both.

100000.times do
  User.create(email: Faker::Internet.email,
              first_name: Faker::Name.first_name,
              last_name: Faker::Name.last_name,
              password: Faker::Internet.password
             )
end

This will take a while, and it can be achieved more quickly by directly inserting into the db, but I just decided to set this and go to lunch instead of trying to optimize it. Run it from a few console windows at the same time to make it faster.

Current Code

The current implementation looks at EITHER the first name or the last name and then concatenates the results. This means if you search 'John', it will bring back 'John Smith', and 'Bob Johnson'. If you search 'John Smith' though, it will not work as that is neither a first name nor a last name, it is both.

first_name_matches = User.where("first_name ILIKE ?", "%#{params[:term]}%").limit(20)
last_name_matches = User.where("last_name ILIKE ?", "%#{params[:term]}%").limit(20)
@people = first_name_matches.concat(last_name_matches)
Naive Implementation

We can get the behavior we want without making any changes to the data or database by simply searching on the concatenated field via SQL. This change looks something like this:

@people = User.where("first_name || ' ' || last_name ILIKE ?", "%#{params[:term]}%").limit(10)

The cost of this query with our 100k users is around 100ms. This is not terrible, but it is noticeable in the UX.

[local] [email protected]_development=# EXPLAIN ANALYZE select * from users where first_name || ' ' || last_name ilike '%toney%
';
                                               QUERY PLAN                                                
---------------------------------------------------------------------------------------------------------
 Seq Scan on users  (cost=0.00..7360.63 rows=160 width=1046) (actual time=4.188..92.154 rows=50 loops=1)
   Filter: ((((first_name)::text || ' '::text) || (last_name)::text) ~~* '%toney%'::text)
   Rows Removed by Filter: 100043
 Planning time: 0.089 ms
 Execution time: 92.198 ms
(5 rows)
Calculated Field

Since we are going to be querying off of this full name, we should just have it as a stored column in the database. Lets create a migration to add full_name to our users table:

class AddFullNameToUsers < ActiveRecord::Migration
  def up
    add_column :users, :full_name, :string

    execute "UPDATE USERS SET full_name = first_name || ' ' || last_name"
  end

  def down
    remove_column :users, :full_name
  end

end

Lets see our now simplified query:

[local] [email protected]_development=# EXPLAIN ANALYZE SELECT * FROM users WHERE full_name ilike '%Toney%';
                                               QUERY PLAN                                               
--------------------------------------------------------------------------------------------------------
 Seq Scan on users  (cost=0.00..6860.16 rows=10 width=1046) (actual time=4.013..91.706 rows=50 loops=1)
   Filter: ((full_name)::text ~~* '%Toney%'::text)
   Rows Removed by Filter: 100043
 Planning time: 0.272 ms
 Execution time: 91.742 ms
(5 rows)

Now we can think about indexing this field to speed up that search. Standard Rails indexing will not work as we are using wildcard matching on both the left and right of our search key. It would work only if we were searching with a left anchor, and searching just to the right, for instance 'Jack%'.

Smart Indexing

We want to index our new full_name field in a way that speeds up this search. For this we are going to use an extension called pg_trgm. This extension adds the ability to split a column into trigrams, which looks a little something like this:

CREATE EXTENSION pg_trgm;
CREATE EXTENSION
Time: 8.400 ms

[local] [email protected]_development=# select show_trgm('hello');
            show_trgm            
---------------------------------
 {"  h"," he",ell,hel,llo,"lo "}
(1 row)

We can create a GIN index off of full_name now using our trigrams.

CREATE INDEX CONCURRENTLY index_users_on_full_name_trigram
ON users
USING gin (full_name gin_trgm_ops);

And compare the same query that we were using above:

[local] [email protected]_development=# EXPLAIN ANALYZE SELECT * FROM users WHERE full_name ilike '%Toney%';
                                                                 QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on users  (cost=28.08..66.94 rows=10 width=1046) (actual time=0.211..0.359 rows=50 loops=1)
   Recheck Cond: ((full_name)::text ~~* '%Toney%'::text)
   Heap Blocks: exact=50
   ->  Bitmap Index Scan on index_users_on_full_name_trigram  (cost=0.00..28.07 rows=10 width=0) (actual time=0.193..0.193 rows=50 loops=1)
         Index Cond: ((full_name)::text ~~* '%Toney%'::text)
 Planning time: 0.441 ms
 Execution time: 0.409 ms
(7 rows)

Thats ~90ms going down to ~1ms on a wildcard search. Quite an improvement.

Putting it into the Application

DoneSafe uses Apartment for multi tenancy. If we are going to add a Postgres Extension like pg_trgm to our Database, we are going to need to make sure that all our Schemas can use it. To do this we create a shared schema that we will always have in our seach path. This means that no matter what schema we are in, the search path will find pg_trgm. Lets make a Rake task for this as Apartment Suggests.

# lib/tasks/db_enhancements.rake

####### Important information ####################
# This file is used to setup a shared extensions #
# within a dedicated schema. This gives us the   #
# advantage of only needing to enable extensions #
# in one place. We use PG TRGM for full_name     #
# search indexing.                               #
#                                                #
# This task should be run AFTER db:create but    #
# BEFORE db:migrate.                             #
##################################################

namespace :db do
  desc 'Also create shared_extensions Schema'
  task :extensions => :environment  do
    # Create Schema
    ActiveRecord::Base.connection.execute 'CREATE SCHEMA IF NOT EXISTS shared_extensions;'
    # Enable PG_TRGM
    ActiveRecord::Base.connection.execute 'CREATE EXTENSION IF NOT EXISTS PG_TRGM SCHEMA shared_extensions;'
  end
end

Rake::Task["db:create"].enhance do
  Rake::Task["db:extensions"].invoke
end

Rake::Task["db:test:purge"].enhance do
  Rake::Task["db:extensions"].invoke
end

We run this before running our Migrations so that the Database has pg_tgrm before it tries to create an index using it.

Changing your postgres search path must be done in both the your Apartment config, and your database.yml.

# config/database.yml
.
.
common:
  schema_search_path: "public, shared_extensions"
.
.

# config/initializers/apartment.rb
.
config.persistent_schemas = %w{ shared_extensions }
.

We now also adjust our full_name migration to use this new index.

class AddFullNameToUsers < ActiveRecord::Migration
  def up
    add_column :users, :full_name, :string

    execute "UPDATE USERS SET full_name = first_name || ' ' || last_name"
    execute "CREATE INDEX index_users_on_full_name_trigram ON users USING gin (full_name gin_trgm_ops)"
  end

  def down
    remove_column :users, :full_name
    execute "DROP INDEX index_users_on_full_name_trigram"
  end

end

You also need to have your Database structure use SQL instead of Ruby. This means that you will have a structure.sql file instead of a schema.rb. Set this in your application config with:

config.active_record.schema_format = :sql

and then run rake db:structure:dump.

The last thing would be to add something to your deploy script that calls our new rake task prior to running the migrations. This may seem like a lot for a small speed up but this feature is now blazing fast, and we now have the ability to add this type of index across the application without having to do the Multi Tenancy set up again. This will also make it easier to use other extensions, such as HSTORE in the future.

Inspiration / Thanks

I implemented the Trigram strategy based on this post by Yorick Peterse at GitLab: Blog

- Jack

comments powered by Disqus