homepage/docs/en-gb/blog/contents/postgresql-zhparser-fuzzy-search.md

---
title: Accelerating Fuzzy Search in PostgreSQL with Tokenisation
tags:
  - postgresql
  - zhparser
  - full-text-search
  - performance
author:
  name: Siu Jam Oh
  email: jamo.siu@gmail.com
---

## Background and Challenges

As our business data surpassed **2 million rows**, traditional `LIKE '%keyword%'` fuzzy queries triggered frequent database I/O alerts, with query response times degrading from milliseconds to seconds. To improve search efficiency and support Chinese semantics, we decided to introduce the `zhparser` extension for full-text search.

## Evolution Path and Environment Adaptation

This implementation went through four key phases, each addressing distinct technical challenges:

### CentOS 7.9 VM (Feasibility Validation)

- **Goal**: Validate the compatibility of `SCWS` + `zhparser` on older systems.
- **Core Action**: Manually compiled `postgresql-16.2` from source in a CentOS 7.9 environment, got the extension working end-to-end.

**Conclusion**: Confirmed significant performance improvements from the tokenisation approach for Chinese search.

### Local Docker Container (Containerisation Exploration)

- **Goal**: Initial testing in a complete local system.
- **Core Action**: Injected binary `.so` files via `docker cp`, resolved `ldconfig` dynamic library path visibility issues.
- **Discovery**: Identified that **missing dictionary files** cause tokenisation to degrade into single-character (particle-level) tokenisation — a critical failure point.

### EulerOS 2.0 Test Server (Self-Compiled Environment Adaptation)

- **Goal**: Adapt to the production OS architecture (x86_64) and self-compiled PostgreSQL installation.
- **Core Issue**: Resolved `libscws.so.1` loading errors.
- **Key Solutions**:
  - Ensured the `postgres` runtime user has access permissions to `/usr/local/scws/lib`.
  - Modified `systemd` service environment variables or created `/usr/lib64` symlinks to force refresh library search paths.

### Production Deployment Preparation (Final Tuning)

- **Goal**: Ensure query stability at 2M+ data volume.
- **Optimisation**: Addressed cases where non-semantic fragments (e.g., "古唐合") returned no results by establishing a "full-text search first + `pg_trgm` index assist" degraded query strategy.

## Core Installation and Configuration Steps (Self-Compiled Environments)

### Installing the `SCWS` Tokenisation Engine

SCWS is the underlying core dependency of `zhparser` and must be installed first.

1. **Download and Extract**: Download the source package (e.g., `scws-1.2.3`).
2. **Compile and Install**:

   ```bash
   ./configure --prefix=/usr/local/scws
   make && make install
   ```

3. **Verify the Library**: Ensure `/usr/local/scws/lib/libscws.so.1` exists.

### Compiling and Installing `zhparser`

This step requires `pg_config` from the self-compiled PostgreSQL installation.

1. **Get the Source**: Clone the `zhparser` project from GitHub.
2. **Compile with Specified Path**:

   ```bash
   # Ensure pg_config is in PATH, or specify manually
   make USE_PGXS=1 PG_CONFIG=/usr/local/pgsql/bin/pg_config
   make USE_PGXS=1 PG_CONFIG=/usr/local/pgsql/bin/pg_config install
   ```

   *Note: The `install` step automatically places `zhparser.so` into PG's `pkglibdir` and extension scripts into the `extension` directory.*

### Resolving Dynamic Library Dependencies

1. **Refresh System Cache**:

   ```bash
   echo "/usr/local/scws/lib" > /etc/ld.so.conf.d/scws.conf
   ldconfig
   ```

2. **Permission Check**: Ensure the OS user running `postgres` has `rx` permission on `/usr/local/scws/lib`.
3. **Force Symlink (Alternative)**: If `ldconfig` fails, symlink the library file to `/usr/lib64`.

### Restarting the Database

After modifying system shared library configuration, the PostgreSQL process must be restarted to reload environment variables and linked libraries.

```bash
## Restart using pg_ctl (paths may vary for self-compiled installations)
/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data restart

## Or restart via systemd (if registered as a service)
systemctl restart postgresql
```

### Database-Level Initialisation

Connect to `psql` and run the logical configuration:

```sql
-- Create the extension
CREATE EXTENSION zhparser;

-- Create a full-text search configuration and bind the tokeniser
CREATE TEXT SEARCH CONFIGURATION chinese (PARSER = zhparser);

-- Add token mappings
ALTER TEXT SEARCH CONFIGURATION chinese ADD MAPPING FOR n,v,a,i,e,l,t,b WITH simple;

-- [Optional] Specify the dictionary path for self-compiled installations
-- ALTER DATABASE postgres SET zhparser.dict_path = '/usr/local/scws/etc/dict.utf8.xdb';
```

## Performance Analysis and Pitfalls

### Index Performance Bottleneck Analysis

During testing, it was found that even exact queries suffered from `Bitmap Index Scan` due to an improperly designed composite index (with `create_time` as the leading column), resulting in query times as high as **482 ms**.

- **Improvement**: Created single-column **B-tree** indexes on frequently searched columns, reducing response time to under **10 ms** with `Index Scan`.

### GIN Index and Non-Semantic Matching

- **Cross-Word Truncation**: The tokenisation engine is semantic-based, so truncated strings like "古唐合" may fail to match with `@@` due to tokenisation boundaries.
- **Mitigation Strategy**: Adopt a "waterfall search" approach. Full-text search (FTS) first; if the result set is empty, automatically degrade to `LIKE` fuzzy queries, assisted by `pg_trgm` indexing.

## Final Deployment Strategy: Dual-Track Parallel Retrieval

After analysing the data, we found that approximately 1.24% of account names contain non-standard Simplified Chinese characters, and some company names in the database are unusual enough to cause search failures. We adopted a "stepwise degradation" strategy:

- **Step 1: Full-Text Search (Fast Track)**: Use GIN index for `@@` matching.
- **Step 2: Result Evaluation**: If the result set is empty, check whether the search term contains letters or suspected Traditional Chinese characters.
- **Step 3: Fuzzy Fallback (Safe Fallback)**: Execute `LIKE '%keyword%'`. Although slower, since this serves as a "gap-fill" logic triggered only ~1% of the time, it won't impose overall system pressure.

## Search Optimisation: Integrating a Custom Business Lexicon

To address issues like company brand names being incorrectly segmented by full-text search (e.g., "元一" being split into a numeral and a quantifier), we built an automated maintenance pipeline from data extraction to index rebuild.

### Lexicon Extraction and Preprocessing

Leverage the structural parsing capabilities of **`companynameparser`** to strip region names and industry suffixes, and use **`jieba`** for semantic validation to ensure core brand name integrity:

- **Extraction Logic**: Traverse all `buyer_unique_name` values via a Python script, extracting the core `brand` field.
- **Weight Compensation**: For words prone to fragmentation (e.g., those containing "元", "一", "三"), manually boost TF (term frequency weight) to **50.0–60.0** to ensure their priority overrides built-in quantifier rules.
- **Output Specification**: Produce SCWS-compliant 4-field `UTF-8` text (WORD, TF, IDF, ATTR). Use tab `\t` separators to avoid parsing anomalies.

### Lexicon Compilation and Deployment

:::tip
Users compiling `xdb` binary dictionary files on Windows can visit OnixByte’s [GitHub](https://github.com/onixbyte/scws/releases/tag/1.2.3) or [GitLab](https://git.onixbyte.cn/onixbyte/scws/-/releases/1.2.3) pages to download the native scws command-line tool for Windows, pre-compiled using MingW.
:::

Convert the text dictionary to SCWS's efficient binary format (XDB):

1. **Compile the Binary Dictionary**:

   ```bash
   # Use scws-gen-dict to generate an encrypted binary lexicon
   /usr/local/scws/bin/scws-gen-dict -i custom_company.txt -o /usr/local/scws/etc/custom_company.xdb -c utf8
   ```

2. **File Distribution and Permissions**: Move the generated `.xdb` file to the tokenisation data directory and ensure the `postgres` user has read permission:

   ```bash
   cp custom_company.xdb /usr/local/pgsql/share/tsearch_data/
   chown postgres:postgres /usr/local/pgsql/share/tsearch_data/custom_company.xdb
   ```

### Database Parameter Configuration

Modify `postgresql.conf` to force-load `zhparser` and its custom extension lexicon:

```plain text
## Preload the extension library (requires restart to take effect)
shared_preload_libraries = 'zhparser'

## Load custom external dictionaries (use paths relative to tsearch_data)
zhparser.extra_dicts = 'custom_company.xdb'
```

### Hot Index Rebuild and Verification

Since tokenisation rules have changed, existing data must be semantically synchronised via index rebuild:

**Physically Restart the Service**:

```bash
su - postgres -c "/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data restart"
```

**Online Index Rebuild**: Use the `CONCURRENTLY` keyword to refresh the GIN index without blocking DML operations on 400K rows:

```bash
REINDEX INDEX CONCURRENTLY index_name;
```

**Tokenisation Effectiveness Verification**:

```sql
-- Expected part-of-speech should show as n (noun), not x (unknown)
SELECT * FROM ts_debug('chinese', '元一能源');
```

**Optimisation Notes:**
- **Explicit Weight Compensation**: This is the key technique that resolved the "元一" tokenisation failure (shown as `x`).
- **Distinguish Restart from Reload**: `shared_preload_libraries` must be activated via `restart`, not a simple reload.