Files
homepage/docs/en-gb/blog/contents/postgresql-zhparser-fuzzy-search.md
T
2026-05-21 01:53:46 -05:00

213 lines
9.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: Accelerating Fuzzy Search in PostgreSQL with Tokenisation
tags:
- postgresql
- zhparser
- full-text-search
- performance
author:
name: Siu Jam Oh
email: jamo.siu@gmail.com
---
## Background and Challenges
As our business data surpassed **2 million rows**, traditional `LIKE '%keyword%'` fuzzy queries triggered frequent database I/O alerts, with query response times degrading from milliseconds to seconds. To improve search efficiency and support Chinese semantics, we decided to introduce the `zhparser` extension for full-text search.
## Evolution Path and Environment Adaptation
This implementation went through four key phases, each addressing distinct technical challenges:
### CentOS 7.9 VM (Feasibility Validation)
- **Goal**: Validate the compatibility of `SCWS` + `zhparser` on older systems.
- **Core Action**: Manually compiled `postgresql-16.2` from source in a CentOS 7.9 environment, got the extension working end-to-end.
**Conclusion**: Confirmed significant performance improvements from the tokenisation approach for Chinese search.
### Local Docker Container (Containerisation Exploration)
- **Goal**: Initial testing in a complete local system.
- **Core Action**: Injected binary `.so` files via `docker cp`, resolved `ldconfig` dynamic library path visibility issues.
- **Discovery**: Identified that **missing dictionary files** cause tokenisation to degrade into single-character (particle-level) tokenisation — a critical failure point.
### EulerOS 2.0 Test Server (Self-Compiled Environment Adaptation)
- **Goal**: Adapt to the production OS architecture (x86_64) and self-compiled PostgreSQL installation.
- **Core Issue**: Resolved `libscws.so.1` loading errors.
- **Key Solutions**:
- Ensured the `postgres` runtime user has access permissions to `/usr/local/scws/lib`.
- Modified `systemd` service environment variables or created `/usr/lib64` symlinks to force refresh library search paths.
### Production Deployment Preparation (Final Tuning)
- **Goal**: Ensure query stability at 2M+ data volume.
- **Optimisation**: Addressed cases where non-semantic fragments (e.g., "古唐合") returned no results by establishing a "full-text search first + `pg_trgm` index assist" degraded query strategy.
## Core Installation and Configuration Steps (Self-Compiled Environments)
### Installing the `SCWS` Tokenisation Engine
SCWS is the underlying core dependency of `zhparser` and must be installed first.
1. **Download and Extract**: Download the source package (e.g., `scws-1.2.3`).
2. **Compile and Install**:
```bash
./configure --prefix=/usr/local/scws
make && make install
```
3. **Verify the Library**: Ensure `/usr/local/scws/lib/libscws.so.1` exists.
### Compiling and Installing `zhparser`
This step requires `pg_config` from the self-compiled PostgreSQL installation.
1. **Get the Source**: Clone the `zhparser` project from GitHub.
2. **Compile with Specified Path**:
```bash
# Ensure pg_config is in PATH, or specify manually
make USE_PGXS=1 PG_CONFIG=/usr/local/pgsql/bin/pg_config
make USE_PGXS=1 PG_CONFIG=/usr/local/pgsql/bin/pg_config install
```
*Note: The `install` step automatically places `zhparser.so` into PG's `pkglibdir` and extension scripts into the `extension` directory.*
### Resolving Dynamic Library Dependencies
1. **Refresh System Cache**:
```bash
echo "/usr/local/scws/lib" > /etc/ld.so.conf.d/scws.conf
ldconfig
```
2. **Permission Check**: Ensure the OS user running `postgres` has `rx` permission on `/usr/local/scws/lib`.
3. **Force Symlink (Alternative)**: If `ldconfig` fails, symlink the library file to `/usr/lib64`.
### Restarting the Database
After modifying system shared library configuration, the PostgreSQL process must be restarted to reload environment variables and linked libraries.
```bash
## Restart using pg_ctl (paths may vary for self-compiled installations)
/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data restart
## Or restart via systemd (if registered as a service)
systemctl restart postgresql
```
### Database-Level Initialisation
Connect to `psql` and run the logical configuration:
```sql
-- Create the extension
CREATE EXTENSION zhparser;
-- Create a full-text search configuration and bind the tokeniser
CREATE TEXT SEARCH CONFIGURATION chinese (PARSER = zhparser);
-- Add token mappings
ALTER TEXT SEARCH CONFIGURATION chinese ADD MAPPING FOR n,v,a,i,e,l,t,b WITH simple;
-- [Optional] Specify the dictionary path for self-compiled installations
-- ALTER DATABASE postgres SET zhparser.dict_path = '/usr/local/scws/etc/dict.utf8.xdb';
```
## Performance Analysis and Pitfalls
### Index Performance Bottleneck Analysis
During testing, it was found that even exact queries suffered from `Bitmap Index Scan` due to an improperly designed composite index (with `create_time` as the leading column), resulting in query times as high as **482 ms**.
- **Improvement**: Created single-column **B-tree** indexes on frequently searched columns, reducing response time to under **10 ms** with `Index Scan`.
### GIN Index and Non-Semantic Matching
- **Cross-Word Truncation**: The tokenisation engine is semantic-based, so truncated strings like "古唐合" may fail to match with `@@` due to tokenisation boundaries.
- **Mitigation Strategy**: Adopt a "waterfall search" approach. Full-text search (FTS) first; if the result set is empty, automatically degrade to `LIKE` fuzzy queries, assisted by `pg_trgm` indexing.
## Final Deployment Strategy: Dual-Track Parallel Retrieval
After analysing the data, we found that approximately 1.24% of account names contain non-standard Simplified Chinese characters, and some company names in the database are unusual enough to cause search failures. We adopted a "stepwise degradation" strategy:
- **Step 1: Full-Text Search (Fast Track)**: Use GIN index for `@@` matching.
- **Step 2: Result Evaluation**: If the result set is empty, check whether the search term contains letters or suspected Traditional Chinese characters.
- **Step 3: Fuzzy Fallback (Safe Fallback)**: Execute `LIKE '%keyword%'`. Although slower, since this serves as a "gap-fill" logic triggered only ~1% of the time, it won't impose overall system pressure.
## Search Optimisation: Integrating a Custom Business Lexicon
To address issues like company brand names being incorrectly segmented by full-text search (e.g., "元一" being split into a numeral and a quantifier), we built an automated maintenance pipeline from data extraction to index rebuild.
### Lexicon Extraction and Preprocessing
Leverage the structural parsing capabilities of **`companynameparser`** to strip region names and industry suffixes, and use **`jieba`** for semantic validation to ensure core brand name integrity:
- **Extraction Logic**: Traverse all `buyer_unique_name` values via a Python script, extracting the core `brand` field.
- **Weight Compensation**: For words prone to fragmentation (e.g., those containing "元", "一", "三"), manually boost TF (term frequency weight) to **50.060.0** to ensure their priority overrides built-in quantifier rules.
- **Output Specification**: Produce SCWS-compliant 4-field `UTF-8` text (WORD, TF, IDF, ATTR). Use tab `\t` separators to avoid parsing anomalies.
### Lexicon Compilation and Deployment
:::tip
Users compiling `xdb` binary dictionary files on Windows can visit OnixBytes [GitHub](https://github.com/onixbyte/scws/releases/tag/1.2.3) or [GitLab](https://git.onixbyte.cn/onixbyte/scws/-/releases/1.2.3) pages to download the native scws command-line tool for Windows, pre-compiled using MingW.
:::
Convert the text dictionary to SCWS's efficient binary format (XDB):
1. **Compile the Binary Dictionary**:
```bash
# Use scws-gen-dict to generate an encrypted binary lexicon
/usr/local/scws/bin/scws-gen-dict -i custom_company.txt -o /usr/local/scws/etc/custom_company.xdb -c utf8
```
2. **File Distribution and Permissions**: Move the generated `.xdb` file to the tokenisation data directory and ensure the `postgres` user has read permission:
```bash
cp custom_company.xdb /usr/local/pgsql/share/tsearch_data/
chown postgres:postgres /usr/local/pgsql/share/tsearch_data/custom_company.xdb
```
### Database Parameter Configuration
Modify `postgresql.conf` to force-load `zhparser` and its custom extension lexicon:
```plain text
## Preload the extension library (requires restart to take effect)
shared_preload_libraries = 'zhparser'
## Load custom external dictionaries (use paths relative to tsearch_data)
zhparser.extra_dicts = 'custom_company.xdb'
```
### Hot Index Rebuild and Verification
Since tokenisation rules have changed, existing data must be semantically synchronised via index rebuild:
**Physically Restart the Service**:
```bash
su - postgres -c "/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data restart"
```
**Online Index Rebuild**: Use the `CONCURRENTLY` keyword to refresh the GIN index without blocking DML operations on 400K rows:
```bash
REINDEX INDEX CONCURRENTLY index_name;
```
**Tokenisation Effectiveness Verification**:
```sql
-- Expected part-of-speech should show as n (noun), not x (unknown)
SELECT * FROM ts_debug('chinese', '元一能源');
```
**Optimisation Notes:**
- **Explicit Weight Compensation**: This is the key technique that resolved the "元一" tokenisation failure (shown as `x`).
- **Distinguish Restart from Reload**: `shared_preload_libraries` must be activated via `restart`, not a simple reload.