docs: migrate documents from Notion

This commit is contained in:
siujamo
2026-05-21 01:53:46 -05:00
parent c7b623012c
commit 0295b758fd
39 changed files with 3292 additions and 4 deletions
@@ -0,0 +1,212 @@
---
title: Accelerating Fuzzy Search in PostgreSQL with Tokenisation
tags:
- postgresql
- zhparser
- full-text-search
- performance
author:
name: Siu Jam Oh
email: jamo.siu@gmail.com
---
## Background and Challenges
As our business data surpassed **2 million rows**, traditional `LIKE '%keyword%'` fuzzy queries triggered frequent database I/O alerts, with query response times degrading from milliseconds to seconds. To improve search efficiency and support Chinese semantics, we decided to introduce the `zhparser` extension for full-text search.
## Evolution Path and Environment Adaptation
This implementation went through four key phases, each addressing distinct technical challenges:
### CentOS 7.9 VM (Feasibility Validation)
- **Goal**: Validate the compatibility of `SCWS` + `zhparser` on older systems.
- **Core Action**: Manually compiled `postgresql-16.2` from source in a CentOS 7.9 environment, got the extension working end-to-end.
**Conclusion**: Confirmed significant performance improvements from the tokenisation approach for Chinese search.
### Local Docker Container (Containerisation Exploration)
- **Goal**: Initial testing in a complete local system.
- **Core Action**: Injected binary `.so` files via `docker cp`, resolved `ldconfig` dynamic library path visibility issues.
- **Discovery**: Identified that **missing dictionary files** cause tokenisation to degrade into single-character (particle-level) tokenisation — a critical failure point.
### EulerOS 2.0 Test Server (Self-Compiled Environment Adaptation)
- **Goal**: Adapt to the production OS architecture (x86_64) and self-compiled PostgreSQL installation.
- **Core Issue**: Resolved `libscws.so.1` loading errors.
- **Key Solutions**:
- Ensured the `postgres` runtime user has access permissions to `/usr/local/scws/lib`.
- Modified `systemd` service environment variables or created `/usr/lib64` symlinks to force refresh library search paths.
### Production Deployment Preparation (Final Tuning)
- **Goal**: Ensure query stability at 2M+ data volume.
- **Optimisation**: Addressed cases where non-semantic fragments (e.g., "古唐合") returned no results by establishing a "full-text search first + `pg_trgm` index assist" degraded query strategy.
## Core Installation and Configuration Steps (Self-Compiled Environments)
### Installing the `SCWS` Tokenisation Engine
SCWS is the underlying core dependency of `zhparser` and must be installed first.
1. **Download and Extract**: Download the source package (e.g., `scws-1.2.3`).
2. **Compile and Install**:
```bash
./configure --prefix=/usr/local/scws
make && make install
```
3. **Verify the Library**: Ensure `/usr/local/scws/lib/libscws.so.1` exists.
### Compiling and Installing `zhparser`
This step requires `pg_config` from the self-compiled PostgreSQL installation.
1. **Get the Source**: Clone the `zhparser` project from GitHub.
2. **Compile with Specified Path**:
```bash
# Ensure pg_config is in PATH, or specify manually
make USE_PGXS=1 PG_CONFIG=/usr/local/pgsql/bin/pg_config
make USE_PGXS=1 PG_CONFIG=/usr/local/pgsql/bin/pg_config install
```
*Note: The `install` step automatically places `zhparser.so` into PG's `pkglibdir` and extension scripts into the `extension` directory.*
### Resolving Dynamic Library Dependencies
1. **Refresh System Cache**:
```bash
echo "/usr/local/scws/lib" > /etc/ld.so.conf.d/scws.conf
ldconfig
```
2. **Permission Check**: Ensure the OS user running `postgres` has `rx` permission on `/usr/local/scws/lib`.
3. **Force Symlink (Alternative)**: If `ldconfig` fails, symlink the library file to `/usr/lib64`.
### Restarting the Database
After modifying system shared library configuration, the PostgreSQL process must be restarted to reload environment variables and linked libraries.
```bash
## Restart using pg_ctl (paths may vary for self-compiled installations)
/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data restart
## Or restart via systemd (if registered as a service)
systemctl restart postgresql
```
### Database-Level Initialisation
Connect to `psql` and run the logical configuration:
```sql
-- Create the extension
CREATE EXTENSION zhparser;
-- Create a full-text search configuration and bind the tokeniser
CREATE TEXT SEARCH CONFIGURATION chinese (PARSER = zhparser);
-- Add token mappings
ALTER TEXT SEARCH CONFIGURATION chinese ADD MAPPING FOR n,v,a,i,e,l,t,b WITH simple;
-- [Optional] Specify the dictionary path for self-compiled installations
-- ALTER DATABASE postgres SET zhparser.dict_path = '/usr/local/scws/etc/dict.utf8.xdb';
```
## Performance Analysis and Pitfalls
### Index Performance Bottleneck Analysis
During testing, it was found that even exact queries suffered from `Bitmap Index Scan` due to an improperly designed composite index (with `create_time` as the leading column), resulting in query times as high as **482 ms**.
- **Improvement**: Created single-column **B-tree** indexes on frequently searched columns, reducing response time to under **10 ms** with `Index Scan`.
### GIN Index and Non-Semantic Matching
- **Cross-Word Truncation**: The tokenisation engine is semantic-based, so truncated strings like "古唐合" may fail to match with `@@` due to tokenisation boundaries.
- **Mitigation Strategy**: Adopt a "waterfall search" approach. Full-text search (FTS) first; if the result set is empty, automatically degrade to `LIKE` fuzzy queries, assisted by `pg_trgm` indexing.
## Final Deployment Strategy: Dual-Track Parallel Retrieval
After analysing the data, we found that approximately 1.24% of account names contain non-standard Simplified Chinese characters, and some company names in the database are unusual enough to cause search failures. We adopted a "stepwise degradation" strategy:
- **Step 1: Full-Text Search (Fast Track)**: Use GIN index for `@@` matching.
- **Step 2: Result Evaluation**: If the result set is empty, check whether the search term contains letters or suspected Traditional Chinese characters.
- **Step 3: Fuzzy Fallback (Safe Fallback)**: Execute `LIKE '%keyword%'`. Although slower, since this serves as a "gap-fill" logic triggered only ~1% of the time, it won't impose overall system pressure.
## Search Optimisation: Integrating a Custom Business Lexicon
To address issues like company brand names being incorrectly segmented by full-text search (e.g., "元一" being split into a numeral and a quantifier), we built an automated maintenance pipeline from data extraction to index rebuild.
### Lexicon Extraction and Preprocessing
Leverage the structural parsing capabilities of **`companynameparser`** to strip region names and industry suffixes, and use **`jieba`** for semantic validation to ensure core brand name integrity:
- **Extraction Logic**: Traverse all `buyer_unique_name` values via a Python script, extracting the core `brand` field.
- **Weight Compensation**: For words prone to fragmentation (e.g., those containing "元", "一", "三"), manually boost TF (term frequency weight) to **50.060.0** to ensure their priority overrides built-in quantifier rules.
- **Output Specification**: Produce SCWS-compliant 4-field `UTF-8` text (WORD, TF, IDF, ATTR). Use tab `\t` separators to avoid parsing anomalies.
### Lexicon Compilation and Deployment
:::tip
Users compiling `xdb` binary dictionary files on Windows can visit OnixBytes [GitHub](https://github.com/onixbyte/scws/releases/tag/1.2.3) or [GitLab](https://git.onixbyte.cn/onixbyte/scws/-/releases/1.2.3) pages to download the native scws command-line tool for Windows, pre-compiled using MingW.
:::
Convert the text dictionary to SCWS's efficient binary format (XDB):
1. **Compile the Binary Dictionary**:
```bash
# Use scws-gen-dict to generate an encrypted binary lexicon
/usr/local/scws/bin/scws-gen-dict -i custom_company.txt -o /usr/local/scws/etc/custom_company.xdb -c utf8
```
2. **File Distribution and Permissions**: Move the generated `.xdb` file to the tokenisation data directory and ensure the `postgres` user has read permission:
```bash
cp custom_company.xdb /usr/local/pgsql/share/tsearch_data/
chown postgres:postgres /usr/local/pgsql/share/tsearch_data/custom_company.xdb
```
### Database Parameter Configuration
Modify `postgresql.conf` to force-load `zhparser` and its custom extension lexicon:
```plain text
## Preload the extension library (requires restart to take effect)
shared_preload_libraries = 'zhparser'
## Load custom external dictionaries (use paths relative to tsearch_data)
zhparser.extra_dicts = 'custom_company.xdb'
```
### Hot Index Rebuild and Verification
Since tokenisation rules have changed, existing data must be semantically synchronised via index rebuild:
**Physically Restart the Service**:
```bash
su - postgres -c "/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data restart"
```
**Online Index Rebuild**: Use the `CONCURRENTLY` keyword to refresh the GIN index without blocking DML operations on 400K rows:
```bash
REINDEX INDEX CONCURRENTLY index_name;
```
**Tokenisation Effectiveness Verification**:
```sql
-- Expected part-of-speech should show as n (noun), not x (unknown)
SELECT * FROM ts_debug('chinese', '元一能源');
```
**Optimisation Notes:**
- **Explicit Weight Compensation**: This is the key technique that resolved the "元一" tokenisation failure (shown as `x`).
- **Distinguish Restart from Reload**: `shared_preload_libraries` must be activated via `restart`, not a simple reload.