Details
-
Type:
Bug
-
Status: Open (View Workflow)
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: None
-
Fix Version/s: 1.9.0
-
Component/s: None
-
Labels:None
Description
https://crosswire.org/pipermail/sword-devel/2026-January/050930.html
https://github.com/crosswire/xiphos/issues/1262
Bug 1 — CLucene: single `+TERM` search returns no results (trailing space workaround)
Description
When performing a Lucene indexed search (`searchType == -4`) with a single term
preceded by `+` (standard Lucene AND syntax), the search returns no results for
certain words. Adding a trailing space to the search string resolves the issue.
Steps to reproduce
```
diatheke -b NET -s lucene -k +prophetess
→ none (NET)
diatheke -b NET -s lucene -k '+prophetess '
→ 8 matches (NET)
```
Not all words are affected. Words like "Jesus" work regardless, but a subset of
words consistently fail with `+TERM` and succeed with `+TERM ` (trailing space).
Impact
Xiphos sidebar search prefixes each term with `+` to perform an implicit AND
search. With a single term, this produces `+TERM` which silently fails for
affected words. Xiphos has applied a workaround (adding a trailing space before
calling SWORD), but the root cause is in SWORD's CLucene integration.
This also affects Greek LXX word searches (see Bug 2).
Root cause
The CLucene `StandardAnalyzer` tokenizer does not finalize the last token
correctly when the input string ends without trailing whitespace. The token is
effectively dropped during query parsing.
Fix
In `SWModule::search()`, append a trailing space to the search string before
passing it to `QueryParser::parse()` in the CLucene code path:
{{cpp
// Append a trailing space to work around a CLucene tokenizer bug where
// the last token is not finalized correctly without a following whitespace.
// This fixes: single +TERM searches, and some Unicode words (e.g. Greek LXX).
SWBuf istrFixed = istr;
istrFixed.append(' ');
q = QueryParser::parse((wchar_t *)utf8ToWChar(istrFixed.c_str()).getRawData(), _T("content"), &analyzer);}}
Note on Xiphos workaround
Once this fix is integrated into SWORD, Xiphos may remove its own trailing-space
workaround (issue #1262). The two trailing spaces that would result in the
interim (SWORD fix + Xiphos workaround both active) are harmless as CLucene
ignores multiple trailing spaces.
Bug 3 — Phrase and multi-word search: French typographic apostrophe not matched
Description
French Bible modules use the typographic apostrophe `'` (U+2019, UTF-8:
`0xE2 0x80 0x99`) in their text, for example in words like `l'Éternel` or
`n'est`. Users typing on a standard keyboard produce the straight apostrophe
`'` (U+0027). As a result, searches for French words containing apostrophes
return no results.
Steps to reproduce
In a French Bible module (e.g. LSG), search for:
```
l'Éternel → no results (straight apostrophe from keyboard)
l'Éternel → results found (typographic apostrophe copy-pasted from text)
```
Affected search types
- `searchType == -1` (phrase search)
- `searchType == -2` (multi-word search)
Lucene indexed search (`searchType == -4`) is not affected as CLucene handles
normalization internally during indexing.
Fix
Three changes in `SWModule::search()`:
1. Normalize the search term at entry (covers all non-regex search types):
```cpp
SWBuf term = istr;
// Normalize typographic apostrophe (U+2019, UTF-8: 0xE2 0x80 0x99) to standard apostrophe
// so that French searches work regardless of which apostrophe the user types
{
std::string normalizedTerm = term.c_str();
size_t pos = 0;
while ((pos = normalizedTerm.find("\xe2\x80\x99", pos)) != std::string::npos)
term = normalizedTerm.c_str();
}
```
2. Normalize verse text in `case -1` (phrase search):
```cpp
textBuf = stripText();
// Normalize typographic apostrophe (U+2019) in verse text to match normalized search term
{
std::string normalizedBuf = textBuf.c_str();
size_t pos = 0;
while ((pos = normalizedBuf.find("\xe2\x80\x99", pos)) != std::string::npos)
textBuf = normalizedBuf.c_str();
}
```
3. Normalize verse text in `case -2` (multi-word search):
```cpp
textBuf = getRawEntry();
// Normalize typographic apostrophe (U+2019) in verse text to match normalized search term
{
std::string normalizedBuf = textBuf.c_str();
size_t pos = 0;
while ((pos = normalizedBuf.find("\xe2\x80\x99", pos)) != std::string::npos) { normalizedBuf.replace(pos, 3, "'"); pos += 1; }
textBuf = normalizedBuf.c_str();
}
```
Design decision
Only the right typographic apostrophe (U+2019) is normalized. The left
typographic apostrophe (U+2018) and backtick (U+0060) are intentionally excluded
as they do not appear in French Bible text.