Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 1.9.0
    • Component/s: None
    • Labels:
      None

      Description

      https://crosswire.org/pipermail/sword-devel/2026-January/050930.html
      https://github.com/crosswire/xiphos/issues/1262

      Bug 1 — CLucene: single `+TERM` search returns no results (trailing space workaround)

      Description

      When performing a Lucene indexed search (`searchType == -4`) with a single term
      preceded by `+` (standard Lucene AND syntax), the search returns no results for
      certain words. Adding a trailing space to the search string resolves the issue.

      Steps to reproduce

      ```
      diatheke -b NET -s lucene -k +prophetess
      → none (NET)

      diatheke -b NET -s lucene -k '+prophetess '
      → 8 matches (NET)
      ```

      Not all words are affected. Words like "Jesus" work regardless, but a subset of
      words consistently fail with `+TERM` and succeed with `+TERM ` (trailing space).

      Impact

      Xiphos sidebar search prefixes each term with `+` to perform an implicit AND
      search. With a single term, this produces `+TERM` which silently fails for
      affected words. Xiphos has applied a workaround (adding a trailing space before
      calling SWORD), but the root cause is in SWORD's CLucene integration.

      This also affects Greek LXX word searches (see Bug 2).

      Root cause

      The CLucene `StandardAnalyzer` tokenizer does not finalize the last token
      correctly when the input string ends without trailing whitespace. The token is
      effectively dropped during query parsing.

      Fix

      In `SWModule::search()`, append a trailing space to the search string before
      passing it to `QueryParser::parse()` in the CLucene code path:

      {{cpp
      // Append a trailing space to work around a CLucene tokenizer bug where
      // the last token is not finalized correctly without a following whitespace.
      // This fixes: single +TERM searches, and some Unicode words (e.g. Greek LXX).
      SWBuf istrFixed = istr;
      istrFixed.append(' ');
      q = QueryParser::parse((wchar_t *)utf8ToWChar(istrFixed.c_str()).getRawData(), _T("content"), &analyzer);}}

      Note on Xiphos workaround

      Once this fix is integrated into SWORD, Xiphos may remove its own trailing-space
      workaround (issue #1262). The two trailing spaces that would result in the
      interim (SWORD fix + Xiphos workaround both active) are harmless as CLucene
      ignores multiple trailing spaces.

      Bug 3 — Phrase and multi-word search: French typographic apostrophe not matched

      Description

      French Bible modules use the typographic apostrophe `'` (U+2019, UTF-8:
      `0xE2 0x80 0x99`) in their text, for example in words like `l'Éternel` or
      `n'est`. Users typing on a standard keyboard produce the straight apostrophe
      `'` (U+0027). As a result, searches for French words containing apostrophes
      return no results.

      Steps to reproduce

      In a French Bible module (e.g. LSG), search for:
      ```
      l'Éternel → no results (straight apostrophe from keyboard)
      l'Éternel → results found (typographic apostrophe copy-pasted from text)
      ```

      Affected search types

      • `searchType == -1` (phrase search)
      • `searchType == -2` (multi-word search)

      Lucene indexed search (`searchType == -4`) is not affected as CLucene handles
      normalization internally during indexing.

      Fix

      Three changes in `SWModule::search()`:

      1. Normalize the search term at entry (covers all non-regex search types):

      ```cpp
      SWBuf term = istr;
      // Normalize typographic apostrophe (U+2019, UTF-8: 0xE2 0x80 0x99) to standard apostrophe
      // so that French searches work regardless of which apostrophe the user types
      {
      std::string normalizedTerm = term.c_str();
      size_t pos = 0;
      while ((pos = normalizedTerm.find("\xe2\x80\x99", pos)) != std::string::npos)

      { normalizedTerm.replace(pos, 3, "'"); pos += 1; }

      term = normalizedTerm.c_str();
      }
      ```

      2. Normalize verse text in `case -1` (phrase search):

      ```cpp
      textBuf = stripText();
      // Normalize typographic apostrophe (U+2019) in verse text to match normalized search term
      {
      std::string normalizedBuf = textBuf.c_str();
      size_t pos = 0;
      while ((pos = normalizedBuf.find("\xe2\x80\x99", pos)) != std::string::npos)

      { normalizedBuf.replace(pos, 3, "'"); pos += 1; }
      textBuf = normalizedBuf.c_str();
      }
      ```

      3. Normalize verse text in `case -2` (multi-word search):

      ```cpp
      textBuf = getRawEntry();
      // Normalize typographic apostrophe (U+2019) in verse text to match normalized search term
      {
      std::string normalizedBuf = textBuf.c_str();
      size_t pos = 0;
      while ((pos = normalizedBuf.find("\xe2\x80\x99", pos)) != std::string::npos) { normalizedBuf.replace(pos, 3, "'"); pos += 1; }

      textBuf = normalizedBuf.c_str();
      }
      ```

      Design decision

      Only the right typographic apostrophe (U+2019) is normalized. The left
      typographic apostrophe (U+2018) and backtick (U+0060) are intentionally excluded
      as they do not appear in French Bible text.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              lafricain Cyrille
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: