Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 1.9.0
Component/s: None
Labels:
None

Description

https://crosswire.org/pipermail/sword-devel/2026-January/050930.html
https://github.com/crosswire/xiphos/issues/1262

Bug 1 — CLucene: single `+TERM` search returns no results (trailing space workaround)

Description

When performing a Lucene indexed search (`searchType == -4`) with a single term
preceded by `+` (standard Lucene AND syntax), the search returns no results for
certain words. Adding a trailing space to the search string resolves the issue.

Steps to reproduce

```
diatheke -b NET -s lucene -k +prophetess
→ none (NET)

diatheke -b NET -s lucene -k '+prophetess '
→ 8 matches (NET)
```

Not all words are affected. Words like "Jesus" work regardless, but a subset of
words consistently fail with `+TERM` and succeed with `+TERM ` (trailing space).

Impact

Xiphos sidebar search prefixes each term with `+` to perform an implicit AND
search. With a single term, this produces `+TERM` which silently fails for
affected words. Xiphos has applied a workaround (adding a trailing space before
calling SWORD), but the root cause is in SWORD's CLucene integration.

This also affects Greek LXX word searches (see Bug 2).

Root cause

The CLucene `StandardAnalyzer` tokenizer does not finalize the last token
correctly when the input string ends without trailing whitespace. The token is
effectively dropped during query parsing.

Fix

In `SWModule::search()`, append a trailing space to the search string before
passing it to `QueryParser::parse()` in the CLucene code path:

{{cpp
// Append a trailing space to work around a CLucene tokenizer bug where
// the last token is not finalized correctly without a following whitespace.
// This fixes: single +TERM searches, and some Unicode words (e.g. Greek LXX).
SWBuf istrFixed = istr;
istrFixed.append(' ');
q = QueryParser::parse((wchar_t *)utf8ToWChar(istrFixed.c_str()).getRawData(), _T("content"), &analyzer);}}

Note on Xiphos workaround

Once this fix is integrated into SWORD, Xiphos may remove its own trailing-space
workaround (issue #1262). The two trailing spaces that would result in the
interim (SWORD fix + Xiphos workaround both active) are harmless as CLucene
ignores multiple trailing spaces.

Bug 3 — Phrase and multi-word search: French typographic apostrophe not matched

Description

French Bible modules use the typographic apostrophe `'` (U+2019, UTF-8:
`0xE2 0x80 0x99`) in their text, for example in words like `l'Éternel` or
`n'est`. Users typing on a standard keyboard produce the straight apostrophe
`'` (U+0027). As a result, searches for French words containing apostrophes
return no results.

Steps to reproduce

In a French Bible module (e.g. LSG), search for:
```
l'Éternel → no results (straight apostrophe from keyboard)
l'Éternel → results found (typographic apostrophe copy-pasted from text)
```

Affected search types

`searchType == -1` (phrase search)
`searchType == -2` (multi-word search)

Lucene indexed search (`searchType == -4`) is not affected as CLucene handles
normalization internally during indexing.

Fix

Three changes in `SWModule::search()`:

1. Normalize the search term at entry (covers all non-regex search types):

```cpp
SWBuf term = istr;
// Normalize typographic apostrophe (U+2019, UTF-8: 0xE2 0x80 0x99) to standard apostrophe
// so that French searches work regardless of which apostrophe the user types
{
std::string normalizedTerm = term.c_str();
size_t pos = 0;
while ((pos = normalizedTerm.find("\xe2\x80\x99", pos)) != std::string::npos)

{ normalizedTerm.replace(pos, 3, "'"); pos += 1; }

term = normalizedTerm.c_str();
}
```

2. Normalize verse text in `case -1` (phrase search):

```cpp
textBuf = stripText();
// Normalize typographic apostrophe (U+2019) in verse text to match normalized search term
{
std::string normalizedBuf = textBuf.c_str();
size_t pos = 0;
while ((pos = normalizedBuf.find("\xe2\x80\x99", pos)) != std::string::npos)

{ normalizedBuf.replace(pos, 3, "'"); pos += 1; }
textBuf = normalizedBuf.c_str();
}
```

3. Normalize verse text in `case -2` (multi-word search):

```cpp
textBuf = getRawEntry();
// Normalize typographic apostrophe (U+2019) in verse text to match normalized search term
{
std::string normalizedBuf = textBuf.c_str();
size_t pos = 0;
while ((pos = normalizedBuf.find("\xe2\x80\x99", pos)) != std::string::npos) { normalizedBuf.replace(pos, 3, "'"); pos += 1; }

textBuf = normalizedBuf.c_str();
}
```

Design decision

Only the right typographic apostrophe (U+2019) is normalized. The left
typographic apostrophe (U+2018) and backtick (U+0060) are intentionally excluded
as they do not appear in French Bible text.

Attachments

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Attachments

swmodule_search_fixes.patch
07/Mar/26 2:54 PM
3 kB
Cyrille
sword_search_fixes_v2.patch
15/Mar/26 1:38 PM
8 kB
Cyrille
sword_search_fixes_v3.patch
15/Mar/26 1:38 PM
8 kB
Cyrille

Some searchs fail

Details

Description

Bug 1 — CLucene: single `+TERM` search returns no results (trailing space workaround)

Description

Steps to reproduce

Impact

Root cause

Fix

Note on Xiphos workaround

Bug 3 — Phrase and multi-word search: French typographic apostrophe not matched

Description

Steps to reproduce

Affected search types

Fix

Design decision

Attachments

Attachments

Activity

People

Dates