213 lines
9.3 KiB
Markdown
213 lines
9.3 KiB
Markdown
---
|
||
title: Accelerating Fuzzy Search in PostgreSQL with Tokenisation
|
||
tags:
|
||
- postgresql
|
||
- zhparser
|
||
- full-text-search
|
||
- performance
|
||
author:
|
||
name: Siu Jam Oh
|
||
email: jamo.siu@gmail.com
|
||
---
|
||
|
||
## Background and Challenges
|
||
|
||
As our business data surpassed **2 million rows**, traditional `LIKE '%keyword%'` fuzzy queries triggered frequent database I/O alerts, with query response times degrading from milliseconds to seconds. To improve search efficiency and support Chinese semantics, we decided to introduce the `zhparser` extension for full-text search.
|
||
|
||
## Evolution Path and Environment Adaptation
|
||
|
||
This implementation went through four key phases, each addressing distinct technical challenges:
|
||
|
||
### CentOS 7.9 VM (Feasibility Validation)
|
||
|
||
- **Goal**: Validate the compatibility of `SCWS` + `zhparser` on older systems.
|
||
- **Core Action**: Manually compiled `postgresql-16.2` from source in a CentOS 7.9 environment, got the extension working end-to-end.
|
||
|
||
**Conclusion**: Confirmed significant performance improvements from the tokenisation approach for Chinese search.
|
||
|
||
### Local Docker Container (Containerisation Exploration)
|
||
|
||
- **Goal**: Initial testing in a complete local system.
|
||
- **Core Action**: Injected binary `.so` files via `docker cp`, resolved `ldconfig` dynamic library path visibility issues.
|
||
- **Discovery**: Identified that **missing dictionary files** cause tokenisation to degrade into single-character (particle-level) tokenisation — a critical failure point.
|
||
|
||
### EulerOS 2.0 Test Server (Self-Compiled Environment Adaptation)
|
||
|
||
- **Goal**: Adapt to the production OS architecture (x86_64) and self-compiled PostgreSQL installation.
|
||
- **Core Issue**: Resolved `libscws.so.1` loading errors.
|
||
- **Key Solutions**:
|
||
- Ensured the `postgres` runtime user has access permissions to `/usr/local/scws/lib`.
|
||
- Modified `systemd` service environment variables or created `/usr/lib64` symlinks to force refresh library search paths.
|
||
|
||
### Production Deployment Preparation (Final Tuning)
|
||
|
||
- **Goal**: Ensure query stability at 2M+ data volume.
|
||
- **Optimisation**: Addressed cases where non-semantic fragments (e.g., "古唐合") returned no results by establishing a "full-text search first + `pg_trgm` index assist" degraded query strategy.
|
||
|
||
## Core Installation and Configuration Steps (Self-Compiled Environments)
|
||
|
||
### Installing the `SCWS` Tokenisation Engine
|
||
|
||
SCWS is the underlying core dependency of `zhparser` and must be installed first.
|
||
|
||
1. **Download and Extract**: Download the source package (e.g., `scws-1.2.3`).
|
||
2. **Compile and Install**:
|
||
|
||
```bash
|
||
./configure --prefix=/usr/local/scws
|
||
make && make install
|
||
```
|
||
|
||
3. **Verify the Library**: Ensure `/usr/local/scws/lib/libscws.so.1` exists.
|
||
|
||
### Compiling and Installing `zhparser`
|
||
|
||
This step requires `pg_config` from the self-compiled PostgreSQL installation.
|
||
|
||
1. **Get the Source**: Clone the `zhparser` project from GitHub.
|
||
2. **Compile with Specified Path**:
|
||
|
||
```bash
|
||
# Ensure pg_config is in PATH, or specify manually
|
||
make USE_PGXS=1 PG_CONFIG=/usr/local/pgsql/bin/pg_config
|
||
make USE_PGXS=1 PG_CONFIG=/usr/local/pgsql/bin/pg_config install
|
||
```
|
||
|
||
*Note: The `install` step automatically places `zhparser.so` into PG's `pkglibdir` and extension scripts into the `extension` directory.*
|
||
|
||
### Resolving Dynamic Library Dependencies
|
||
|
||
1. **Refresh System Cache**:
|
||
|
||
```bash
|
||
echo "/usr/local/scws/lib" > /etc/ld.so.conf.d/scws.conf
|
||
ldconfig
|
||
```
|
||
|
||
2. **Permission Check**: Ensure the OS user running `postgres` has `rx` permission on `/usr/local/scws/lib`.
|
||
3. **Force Symlink (Alternative)**: If `ldconfig` fails, symlink the library file to `/usr/lib64`.
|
||
|
||
### Restarting the Database
|
||
|
||
After modifying system shared library configuration, the PostgreSQL process must be restarted to reload environment variables and linked libraries.
|
||
|
||
```bash
|
||
## Restart using pg_ctl (paths may vary for self-compiled installations)
|
||
/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data restart
|
||
|
||
## Or restart via systemd (if registered as a service)
|
||
systemctl restart postgresql
|
||
```
|
||
|
||
### Database-Level Initialisation
|
||
|
||
Connect to `psql` and run the logical configuration:
|
||
|
||
```sql
|
||
-- Create the extension
|
||
CREATE EXTENSION zhparser;
|
||
|
||
-- Create a full-text search configuration and bind the tokeniser
|
||
CREATE TEXT SEARCH CONFIGURATION chinese (PARSER = zhparser);
|
||
|
||
-- Add token mappings
|
||
ALTER TEXT SEARCH CONFIGURATION chinese ADD MAPPING FOR n,v,a,i,e,l,t,b WITH simple;
|
||
|
||
-- [Optional] Specify the dictionary path for self-compiled installations
|
||
-- ALTER DATABASE postgres SET zhparser.dict_path = '/usr/local/scws/etc/dict.utf8.xdb';
|
||
```
|
||
|
||
## Performance Analysis and Pitfalls
|
||
|
||
### Index Performance Bottleneck Analysis
|
||
|
||
During testing, it was found that even exact queries suffered from `Bitmap Index Scan` due to an improperly designed composite index (with `create_time` as the leading column), resulting in query times as high as **482 ms**.
|
||
|
||
- **Improvement**: Created single-column **B-tree** indexes on frequently searched columns, reducing response time to under **10 ms** with `Index Scan`.
|
||
|
||
### GIN Index and Non-Semantic Matching
|
||
|
||
- **Cross-Word Truncation**: The tokenisation engine is semantic-based, so truncated strings like "古唐合" may fail to match with `@@` due to tokenisation boundaries.
|
||
- **Mitigation Strategy**: Adopt a "waterfall search" approach. Full-text search (FTS) first; if the result set is empty, automatically degrade to `LIKE` fuzzy queries, assisted by `pg_trgm` indexing.
|
||
|
||
## Final Deployment Strategy: Dual-Track Parallel Retrieval
|
||
|
||
After analysing the data, we found that approximately 1.24% of account names contain non-standard Simplified Chinese characters, and some company names in the database are unusual enough to cause search failures. We adopted a "stepwise degradation" strategy:
|
||
|
||
- **Step 1: Full-Text Search (Fast Track)**: Use GIN index for `@@` matching.
|
||
- **Step 2: Result Evaluation**: If the result set is empty, check whether the search term contains letters or suspected Traditional Chinese characters.
|
||
- **Step 3: Fuzzy Fallback (Safe Fallback)**: Execute `LIKE '%keyword%'`. Although slower, since this serves as a "gap-fill" logic triggered only ~1% of the time, it won't impose overall system pressure.
|
||
|
||
## Search Optimisation: Integrating a Custom Business Lexicon
|
||
|
||
To address issues like company brand names being incorrectly segmented by full-text search (e.g., "元一" being split into a numeral and a quantifier), we built an automated maintenance pipeline from data extraction to index rebuild.
|
||
|
||
### Lexicon Extraction and Preprocessing
|
||
|
||
Leverage the structural parsing capabilities of **`companynameparser`** to strip region names and industry suffixes, and use **`jieba`** for semantic validation to ensure core brand name integrity:
|
||
|
||
- **Extraction Logic**: Traverse all `buyer_unique_name` values via a Python script, extracting the core `brand` field.
|
||
- **Weight Compensation**: For words prone to fragmentation (e.g., those containing "元", "一", "三"), manually boost TF (term frequency weight) to **50.0–60.0** to ensure their priority overrides built-in quantifier rules.
|
||
- **Output Specification**: Produce SCWS-compliant 4-field `UTF-8` text (WORD, TF, IDF, ATTR). Use tab `\t` separators to avoid parsing anomalies.
|
||
|
||
### Lexicon Compilation and Deployment
|
||
|
||
:::tip
|
||
Users compiling `xdb` binary dictionary files on Windows can visit OnixByte’s [GitHub](https://github.com/onixbyte/scws/releases/tag/1.2.3) or [GitLab](https://git.onixbyte.cn/onixbyte/scws/-/releases/1.2.3) pages to download the native scws command-line tool for Windows, pre-compiled using MingW.
|
||
:::
|
||
|
||
Convert the text dictionary to SCWS's efficient binary format (XDB):
|
||
|
||
1. **Compile the Binary Dictionary**:
|
||
|
||
```bash
|
||
# Use scws-gen-dict to generate an encrypted binary lexicon
|
||
/usr/local/scws/bin/scws-gen-dict -i custom_company.txt -o /usr/local/scws/etc/custom_company.xdb -c utf8
|
||
```
|
||
|
||
2. **File Distribution and Permissions**: Move the generated `.xdb` file to the tokenisation data directory and ensure the `postgres` user has read permission:
|
||
|
||
```bash
|
||
cp custom_company.xdb /usr/local/pgsql/share/tsearch_data/
|
||
chown postgres:postgres /usr/local/pgsql/share/tsearch_data/custom_company.xdb
|
||
```
|
||
|
||
### Database Parameter Configuration
|
||
|
||
Modify `postgresql.conf` to force-load `zhparser` and its custom extension lexicon:
|
||
|
||
```plain text
|
||
## Preload the extension library (requires restart to take effect)
|
||
shared_preload_libraries = 'zhparser'
|
||
|
||
## Load custom external dictionaries (use paths relative to tsearch_data)
|
||
zhparser.extra_dicts = 'custom_company.xdb'
|
||
```
|
||
|
||
### Hot Index Rebuild and Verification
|
||
|
||
Since tokenisation rules have changed, existing data must be semantically synchronised via index rebuild:
|
||
|
||
**Physically Restart the Service**:
|
||
|
||
```bash
|
||
su - postgres -c "/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data restart"
|
||
```
|
||
|
||
**Online Index Rebuild**: Use the `CONCURRENTLY` keyword to refresh the GIN index without blocking DML operations on 400K rows:
|
||
|
||
```bash
|
||
REINDEX INDEX CONCURRENTLY index_name;
|
||
```
|
||
|
||
**Tokenisation Effectiveness Verification**:
|
||
|
||
```sql
|
||
-- Expected part-of-speech should show as n (noun), not x (unknown)
|
||
SELECT * FROM ts_debug('chinese', '元一能源');
|
||
```
|
||
|
||
**Optimisation Notes:**
|
||
- **Explicit Weight Compensation**: This is the key technique that resolved the "元一" tokenisation failure (shown as `x`).
|
||
- **Distinguish Restart from Reload**: `shared_preload_libraries` must be activated via `restart`, not a simple reload.
|