docs: migrate documents from Notion

2026-05-21 01:53:46 -05:00
parent c7b623012c
commit 0295b758fd
39 changed files with 3292 additions and 4 deletions
@@ -0,0 +1,212 @@
+---
+title: Accelerating Fuzzy Search in PostgreSQL with Tokenisation
+tags:
+  - postgresql
+  - zhparser
+  - full-text-search
+  - performance
+author:
+  name: Siu Jam Oh
+  email: jamo.siu@gmail.com
+---
+
+## Background and Challenges
+
+As our business data surpassed **2 million rows**, traditional `LIKE '%keyword%'` fuzzy queries triggered frequent database I/O alerts, with query response times degrading from milliseconds to seconds. To improve search efficiency and support Chinese semantics, we decided to introduce the `zhparser` extension for full-text search.
+
+## Evolution Path and Environment Adaptation
+
+This implementation went through four key phases, each addressing distinct technical challenges:
+
+### CentOS 7.9 VM (Feasibility Validation)
+
+- **Goal**: Validate the compatibility of `SCWS` + `zhparser` on older systems.
+- **Core Action**: Manually compiled `postgresql-16.2` from source in a CentOS 7.9 environment, got the extension working end-to-end.
+
+**Conclusion**: Confirmed significant performance improvements from the tokenisation approach for Chinese search.
+
+### Local Docker Container (Containerisation Exploration)
+
+- **Goal**: Initial testing in a complete local system.
+- **Core Action**: Injected binary `.so` files via `docker cp`, resolved `ldconfig` dynamic library path visibility issues.
+- **Discovery**: Identified that **missing dictionary files** cause tokenisation to degrade into single-character (particle-level) tokenisation — a critical failure point.
+
+### EulerOS 2.0 Test Server (Self-Compiled Environment Adaptation)
+
+- **Goal**: Adapt to the production OS architecture (x86_64) and self-compiled PostgreSQL installation.
+- **Core Issue**: Resolved `libscws.so.1` loading errors.
+- **Key Solutions**:
+  - Ensured the `postgres` runtime user has access permissions to `/usr/local/scws/lib`.
+  - Modified `systemd` service environment variables or created `/usr/lib64` symlinks to force refresh library search paths.
+
+### Production Deployment Preparation (Final Tuning)
+
+- **Goal**: Ensure query stability at 2M+ data volume.
+- **Optimisation**: Addressed cases where non-semantic fragments (e.g., "古唐合") returned no results by establishing a "full-text search first + `pg_trgm` index assist" degraded query strategy.
+
+## Core Installation and Configuration Steps (Self-Compiled Environments)
+
+### Installing the `SCWS` Tokenisation Engine
+
+SCWS is the underlying core dependency of `zhparser` and must be installed first.
+
+1. **Download and Extract**: Download the source package (e.g., `scws-1.2.3`).
+2. **Compile and Install**:
+
+   ```bash
+   ./configure --prefix=/usr/local/scws
+   make && make install
+   ```
+
+3. **Verify the Library**: Ensure `/usr/local/scws/lib/libscws.so.1` exists.
+
+### Compiling and Installing `zhparser`
+
+This step requires `pg_config` from the self-compiled PostgreSQL installation.
+
+1. **Get the Source**: Clone the `zhparser` project from GitHub.
+2. **Compile with Specified Path**:
+
+   ```bash
+   # Ensure pg_config is in PATH, or specify manually
+   make USE_PGXS=1 PG_CONFIG=/usr/local/pgsql/bin/pg_config
+   make USE_PGXS=1 PG_CONFIG=/usr/local/pgsql/bin/pg_config install
+   ```
+
+   *Note: The `install` step automatically places `zhparser.so` into PG's `pkglibdir` and extension scripts into the `extension` directory.*
+
+### Resolving Dynamic Library Dependencies
+
+1. **Refresh System Cache**:
+
+   ```bash
+   echo "/usr/local/scws/lib" > /etc/ld.so.conf.d/scws.conf
+   ldconfig
+   ```
+
+2. **Permission Check**: Ensure the OS user running `postgres` has `rx` permission on `/usr/local/scws/lib`.
+3. **Force Symlink (Alternative)**: If `ldconfig` fails, symlink the library file to `/usr/lib64`.
+
+### Restarting the Database
+
+After modifying system shared library configuration, the PostgreSQL process must be restarted to reload environment variables and linked libraries.
+
+```bash
+## Restart using pg_ctl (paths may vary for self-compiled installations)
+/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data restart
+
+## Or restart via systemd (if registered as a service)
+systemctl restart postgresql
+```
+
+### Database-Level Initialisation
+
+Connect to `psql` and run the logical configuration:
+
+```sql
+-- Create the extension
+CREATE EXTENSION zhparser;
+
+-- Create a full-text search configuration and bind the tokeniser
+CREATE TEXT SEARCH CONFIGURATION chinese (PARSER = zhparser);
+
+-- Add token mappings
+ALTER TEXT SEARCH CONFIGURATION chinese ADD MAPPING FOR n,v,a,i,e,l,t,b WITH simple;
+
+-- [Optional] Specify the dictionary path for self-compiled installations
+-- ALTER DATABASE postgres SET zhparser.dict_path = '/usr/local/scws/etc/dict.utf8.xdb';
+```
+
+## Performance Analysis and Pitfalls
+
+### Index Performance Bottleneck Analysis
+
+During testing, it was found that even exact queries suffered from `Bitmap Index Scan` due to an improperly designed composite index (with `create_time` as the leading column), resulting in query times as high as **482 ms**.
+
+- **Improvement**: Created single-column **B-tree** indexes on frequently searched columns, reducing response time to under **10 ms** with `Index Scan`.
+
+### GIN Index and Non-Semantic Matching
+
+- **Cross-Word Truncation**: The tokenisation engine is semantic-based, so truncated strings like "古唐合" may fail to match with `@@` due to tokenisation boundaries.
+- **Mitigation Strategy**: Adopt a "waterfall search" approach. Full-text search (FTS) first; if the result set is empty, automatically degrade to `LIKE` fuzzy queries, assisted by `pg_trgm` indexing.
+
+## Final Deployment Strategy: Dual-Track Parallel Retrieval
+
+After analysing the data, we found that approximately 1.24% of account names contain non-standard Simplified Chinese characters, and some company names in the database are unusual enough to cause search failures. We adopted a "stepwise degradation" strategy:
+
+- **Step 1: Full-Text Search (Fast Track)**: Use GIN index for `@@` matching.
+- **Step 2: Result Evaluation**: If the result set is empty, check whether the search term contains letters or suspected Traditional Chinese characters.
+- **Step 3: Fuzzy Fallback (Safe Fallback)**: Execute `LIKE '%keyword%'`. Although slower, since this serves as a "gap-fill" logic triggered only ~1% of the time, it won't impose overall system pressure.
+
+## Search Optimisation: Integrating a Custom Business Lexicon
+
+To address issues like company brand names being incorrectly segmented by full-text search (e.g., "元一" being split into a numeral and a quantifier), we built an automated maintenance pipeline from data extraction to index rebuild.
+
+### Lexicon Extraction and Preprocessing
+
+Leverage the structural parsing capabilities of **`companynameparser`** to strip region names and industry suffixes, and use **`jieba`** for semantic validation to ensure core brand name integrity:
+
+- **Extraction Logic**: Traverse all `buyer_unique_name` values via a Python script, extracting the core `brand` field.
+- **Weight Compensation**: For words prone to fragmentation (e.g., those containing "元", "一", "三"), manually boost TF (term frequency weight) to **50.0–60.0** to ensure their priority overrides built-in quantifier rules.
+- **Output Specification**: Produce SCWS-compliant 4-field `UTF-8` text (WORD, TF, IDF, ATTR). Use tab `\t` separators to avoid parsing anomalies.
+
+### Lexicon Compilation and Deployment
+
+:::tip
+Users compiling `xdb` binary dictionary files on Windows can visit OnixByte’s [GitHub](https://github.com/onixbyte/scws/releases/tag/1.2.3) or [GitLab](https://git.onixbyte.cn/onixbyte/scws/-/releases/1.2.3) pages to download the native scws command-line tool for Windows, pre-compiled using MingW.
+:::
+
+Convert the text dictionary to SCWS's efficient binary format (XDB):
+
+1. **Compile the Binary Dictionary**:
+
+   ```bash
+   # Use scws-gen-dict to generate an encrypted binary lexicon
+   /usr/local/scws/bin/scws-gen-dict -i custom_company.txt -o /usr/local/scws/etc/custom_company.xdb -c utf8
+   ```
+
+2. **File Distribution and Permissions**: Move the generated `.xdb` file to the tokenisation data directory and ensure the `postgres` user has read permission:
+
+   ```bash
+   cp custom_company.xdb /usr/local/pgsql/share/tsearch_data/
+   chown postgres:postgres /usr/local/pgsql/share/tsearch_data/custom_company.xdb
+   ```
+
+### Database Parameter Configuration
+
+Modify `postgresql.conf` to force-load `zhparser` and its custom extension lexicon:
+
+```plain text
+## Preload the extension library (requires restart to take effect)
+shared_preload_libraries = 'zhparser'
+
+## Load custom external dictionaries (use paths relative to tsearch_data)
+zhparser.extra_dicts = 'custom_company.xdb'
+```
+
+### Hot Index Rebuild and Verification
+
+Since tokenisation rules have changed, existing data must be semantically synchronised via index rebuild:
+
+**Physically Restart the Service**:
+
+```bash
+su - postgres -c "/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data restart"
+```
+
+**Online Index Rebuild**: Use the `CONCURRENTLY` keyword to refresh the GIN index without blocking DML operations on 400K rows:
+
+```bash
+REINDEX INDEX CONCURRENTLY index_name;
+```
+
+**Tokenisation Effectiveness Verification**:
+
+```sql
+-- Expected part-of-speech should show as n (noun), not x (unknown)
+SELECT * FROM ts_debug('chinese', '元一能源');
+```
+
+**Optimisation Notes:**
+- **Explicit Weight Compensation**: This is the key technique that resolved the "元一" tokenisation failure (shown as `x`).
+- **Distinguish Restart from Reload**: `shared_preload_libraries` must be activated via `restart`, not a simple reload.