For me, there are few things more irritating than over-engineering. Where ever possible I try to avoid using anything but the bare minimum necessary; making my code, my car, my life as easy to repair as necessary.
Progress isn’t made by early risers. It’s made by lazy men trying to find easier ways to do something.
– Robert A. Heinlein
This is especially true when discussing databases. Often when discussing text search, the first thing that comes to mind is ElasticSearch – indeed it’s a great product, works well, but can often be a pain to setup and maintain. PostgreSQL in contrast dead simple to set up, runs anywhere, is easy to maintain and probably is “good enough”.
What to Expect
This article discusses full-text search in PostgreSQL. It’ll walk through several methods, analyze and explain the method(s), and finally propose a performant solution.
For demonstration purposes, I’ll be using a subset of the database I keep locally to test HNProfile.com and RedditProfile.com, which has right around 20 million comments in the database.
Why Full-Text Search in PostgreSQL
It’s easy to setup, maintain, and there’s already an effective deployment pattern in companies.
I run a company called Metacortex, where all of our products are focused on understanding how people think. Almost exclusively, our processed data[1] is stored in PostgreSQL databases. Our website ProjectPiglet.com, for instance, uses it exclusively – even though daily we process tens of thousands of comments, with millions of database inserts & reads.
It’s often said, that there are better options for full-text search and technically, that’s true! However, pragmatism is often an engineers best friend and PostgreSQL is easy for us – as the option is almost always available.
[1] Raw data is stored in S3, as it’s way too large for PostgreSQL
The Dataset
Our dataset is a subset of 20 million comments I have for testing HNProfile.com and RedditProfile.com. We will boil that down further to around 5.5 million comments when we search between 2018-01-01 and 2018-07-07.
The table, called “comments” is in the following form:
Initially, we can assume there are no indexes. However, we will build them.
Test Queries
There will be two tests:
- Speed test – “EXPLAIN ANALYZE” added before the query
- Accuracy test – Ensure between the dates ‘2018-07-01’ and ‘2018-07-07’ they rank the same stories
To measure accuracy: we will be searching for comments for the term ‘google’, grouping by the story_url, and counting how many times the term ‘google’ is mentioned in the comments. The goal being, we want to ensure the stories at the top are related to ‘google’ – we can assume the comments relate to them.
Hardware using:
- Postgres 10.3 (default config)
- CPU: AMD Ryzen 7 1800x eight-core processor
- RAM: 32 Gb
During testing, PostgreSQL never actually broke 2Gb of RAM or over 10% CPU utilization.
Super Slow & Reasonably Accurate Full-Text Search
The first method of full-text search in PostgreSQL we will discuss is probably the slowest way to possibly do it. It may work on datasets of small sizes (< 1,000 entries). However, for us, it really won’t do. It takes around two minutes to search the database…
The first method uses tsvectors. Which is implemented using lexemes or normalized words. Explained another way, the more similar a word looks, the higher the “match” score (i.e. the higher the rank), this is called “fuzzy matching“. This improves search results but increases the time of the search.
In our case, it takes 152 seconds to search all the text of our 5.5 million comments:
This is insanely slow if it was an application, but probably pretty accurate in terms of identifying the term “google” being used in the comments (the results being related to Google).
Slow & Semi-Accurate Full-Text Search
The second method is less accurate, but is probably “good enough” and does provide us results 3x faster at 42 seconds. This method is essentially a regex search through the comment text, which works well enough for a single one-off query – but stil not good for an application at scale.
The accuracy of the number of times “google” is mentioned in the comments regarding each of these stories is relatively low (compared to our previous slow, but accurate results).
That’s
- 3x faster than the previous result
- Probably 50% as accurate
Building Fast & Accurate Full-Text Search
Now, we’ll walk through the way to make this way fast enough for a web app. The trick, may be counter intuitive, but it is to use the first method. Essentially, we need to keep the accuracy from above, while at the same time ensuring it is something <2 seconds (as opposed to 150+ seconds)
To do this, we can use a GIN index on “comment_text”, which will allow us to search the index much faster. However, rather than putting it directly on the text field, we’re going to create a new column and add an index to it:
This ensures, that it is seperate from the raw text and allows us to weight the search queries. This can be important if we’d like to (as do in this example), return all the stories in which ‘google’ has been discussed in our dataset (even if ‘google’ isn’t mentioned explicitly, if it’s in the title, we can assume it’s being disucssed).
Thus we fill our new column with the tsvector with desired weighting:
Finally, we create a function, which triggers every time a new comment is added. This is to ensure the proper weighting is always added to the “tsv_comment_text” column:
Fast Full-Text Search PostgreSQL
Overall, the results speak for themselves. Much higher accuracy, at a speed we could live with:
That makes the results
- 60x faster than the first results
- 16x faster than our second attempt
- Probably the most accurate results
That’s a speed of: 2,067,669 comments searched per second. For referrence – on my machine (which did these queries) with the ability to also insert around 10,000 comments per second to the database.
In other words, our indexing and search ability is now within range of Elastic Search. Athough PostgreSQL is slower, with [likely] slightly worse results and [possibly] limited by capacity – it’s still likely “good enough”, at a fairly large scale.
Futher Improvements to Full-Text Search
There are still a few optimizations we can do; one in particular is using context to search a smaller data space. There is rarely a case where you have to do a full-text search. Instead, if you already know the type or context of the searches, remove unnecessary words or search a subset of the data.
For instance, at Metacortex – we have a unique way of doing topic modeling that enables us to obtain improved results. A typical query over the same dataset is around 30ms – 200ms. That’s using the exact same methods described, on a much larger datset. Thats simply because we search a much smaller data space than the examples above; although our method is technically not full-text search.
If you’re interested in learning more about Metacortex (my company), PostgreSQL or really anything – feel free to reach out.
Pretty cool way to save the ts_vector for quick matching! It reminds me of an optimization we added to AdRoll/batchiepatchie to use gin trigram indexes to speed up substring matching. It performs well on our jobs table of ~7million, with trigram indexes on 6 columns. The migration is here:
https://github.com/AdRoll/batchiepatchie/blob/master/migrations/00015_pg_trgm_gin_indexes.sql
Yes, PostgreSQL built-in FTS is really great, except when you want to rank the FTS results according to their relevance. Then it is significantly slower than ES. In such a case, look at https://github.com/postgrespro/rum.
What about generated columns?
ALTER TABLE pg_search_documents
ADD COLUMN tsvector_content_dmetaphone tsvector GENERATED ALWAYS AS (
to_tsvector(‘simple’, pg_search_dmetaphone(coalesce(“pg_search_documents”.”content”::text, ”)))
) STORED;