Lyra v0.3.0 - What's new
Learn the new features of Lyra v0.3.0, the JavaScript-based full-text search engine.
Lyra just hit a new minor release!
Version 0.3.0 introduces some great new features, as well as general performance enhancements and bug fixes.
It also introduces a new breaking change, so make sure to check it out.
Video version
TL;DR
BREAKING: New returning documents shape
Lyra now returns all the documents inside of{ id: string, score: number, document: T }
, wheredocument
contains the original document.Token relevance
Lyra search now ranks results based on search token relevance.Semi-schemaless
Lyra now is semi-schemaless; choose which properties you want to index, and avoid describing large, complex schemas if you have non-searchable properties.No reserved properties anymore
With older versions of Lyra, property names such asid
were reserved. Now you can insert documents containing these properties without any problem.Internals
Lyra now exposes its internal methods to allow everyone to write their own integrations quickly.General fixes
New performance enhancements and bug fixes.
BREAKING: New returning documents shape
Before Lyra v0.3.0
, the search
function used to return the following object:
{
elapsed: 300n,
count: 1,
hits: [
{
"id": "35026070-456125",
"foo": "bar",
...
}
]
}
Starting from v0.3.0
, Lyra modified the way documents get returned, enriching the information contained in the hits
property:
{
elapsed: 300n,
count: 1,
hits: [
{
"id": "35026070-456125",
"score": 0.04449919616347632
"document": {
"foo": "bar"
...
}
}
]
}
You can now access the full, original document by accessing the new document
property inside every hits
object.
The score
property will tell you how relevant the search result is.
TF-IDF Ranking
One new significant feature is that Lyra now sorts search results by relevance.
Starting from v0.3.0
, Lyra implements the TF-IDF ranking algorithm to provide more relevant results while performing any kind of search.
Before v0.3.0
, Lyra would sort all the search results by document ID (like in a FIFO queue). Now it sorts everything by relevance.
To explain this concept, let's pretend we have the following four documents:
{"id-01": "The quick brown fox jumps over the lazy dog"}
{"id-02": "I love my dog!"}
{"id-03": "This quick fox is jumping over a giraffe. What a fox!"}
{"id-04": "I love this brown fox. Fox is my favorite animal ever. Give me a fox"}
If we're going to search for "quick fox"
, for example, we can clearly see that some results might be more appropriate than others; the way we define which document is more relevant is by performing a set operation following the term frequencies - inverted document frequencies algorithm.
Every time you insert new data, Lyra will tokenize the new document, then, for each token, it will:
- Count the number of times the token appears in a particular document
- Calculate the
TF
value, that is to say, the term count divided by the number of words in a particular document - Count the number of documents in which the token appears
- Calculate the
DF
, which is the document count divided by the total number of documents - Calculate the
IDF
, the inverse of DF, after which a logarithmic function is applied
Every time a new search operation is performed, Lyra will use the data above to calculate how much a given document is relevant to the search query by assigning it a score.
The results will always be sorted in descending order, from the more relevant, to the less relevant.
Learn more about TF-IDF here: learndatasci.com/glossary/tf-idf-term-frequ..
Semi-schemaless
When working on large datasets, it is common to have documents with a large number of properties, and maybe some of them are not even relevant for any search purpose.
Also, consider that currently Lyra, including v0.3.0
, only performs search operations on strings.
With that being said, let's consider the following schema:
import { create } from '@lyrasearch/lyra'
const db = create({
schema: {
author: 'string',
quote: 'string',
favorite: 'boolean', // <-- unsearchable
tags: 'string[]' // <-- unsupported type!
}
})
Why does Lyra need to know that a given property is of a certain type if is not searchable?
The main reason for Lyra to know types is because we're experimenting with the possibility of performing filtering operations depending on booleans, numbers, etc.
Starting from v0.3.0
, it will no longer be necessary to list any non-searchable property as part of the Lyra schema.
In fact, it will be possible to rewrite the schema definition above as follows:
import { create } from '@lyrasearch/lyra'
const db = create({
schema: {
author: 'string',
quote: 'string',
}
})
and still, be able to insert documents like:
{
"author": "Rumi",
"quote": "Patience is the key to joy",
"isFavorite": true,
"tags": ["inspirational", "deep"]
}
or even documents with different shapes:
[
{
"author": "Rumi",
"quote": "Patience is the key to joy",
"isFavorite": true,
"tags": ["inspirational", "deep"]
},
{
"author": "Rumi",
"quote": "Grace comes to forgive and then forgive again",
"score": 10,
"link": null
}
]
of course, it will only be possible to perform search operations on known properties, in that case, author
and quote
, which will always need to be of type string
(as stated during the schema definition).
No reserved properties anymore
Before Lyra v0.3.0
, some property names such as id
were forbidden. For instance, inserting the following document would have caused an error:
{
"id": "12939123", // <--- "id" was a reserved property name
"foo": "bar",
"favorite": true
}
Starting with v0.3.0
, there will be no forbidden properties, so the document above will be totally fine.
Internals
Extending Lyra has just become easier than ever.
Starting from v0.3.0
, Lyra exposes some of its internals:
import {
formatNanoseconds,
getNanosecondsTime,
intersectTokenScores,
includes,
boundedLevenshtein,
tokenize
} from '@lyrasearch/lyra/dist/esm/internals'
Every exposed method comes with its own type definition.
Let's break them down:
formatNanoseconds
: takes aBigInt
as input and returns a human-readablestring
.import { formatNanoseconds } from '@lyrasearch/lyra/dist/esm/internals' formatNanoseconds(30000n) // "30μs"
getNanosecondsTime
: gets the current time with nanoseconds-precision. Returns aBigInt
.import { getNanosecondsTime } from '@lyrasearch/lyra/dist/esm/internals' getNanosecondsTime() // 1363500821581208n
intersectTokenScores
: returns the intersection ofN
arrays.
import { intersectTokenScores } from '@lyrasearch/lyra/dist/esm/internals'
intersectTokenScores([
[
["foo", 1],
["bar", 1],
["baz", 2],
],
[
["foo", 4],
["quick", 10],
["brown", 3],
["bar", 2],
],
[
["fox", 12],
["foo", 4],
["jumps", 3],
["bar", 6],
],
])
// Result: [["foo", 9], ["bar", 9]]
includes
: faster alternative toArray.prototype.includes
.import { includes } from '@lyrasearch/lyra/dist/esm/internals' includes([10,20,30], 10) // true
boundedLevenshtein
: Computes the Levenshtein distance between two strings(a, b)
, returning early with -1 if the distance is greater than the given tolerance. It assumes thattolerance >= ||a| - |b|| >= 0
.import { boundedLevenshtein } from '@lyrasearch/lyra/dist/esm/internals' boundedLevenshtein("moon", "lions", 3) // { isBounded: true, distance: 3 }
tokenize
: tokenizes an input string:import { tokenize } from '@lyrasearch/lyra/dist/esm/internals' tokenize("hello, world!") // ["hello", "world"]
General fixes
With #167 and #166, we introduced a good number of performance optimizations and cleared the code.
Aknowledgements
This release has been made possible by:
- Michele Riva - @MicheleRivaCode
- Paolo Insogna - @p_insogna
- Matteo Pietro Dazzi - @ilteoood
- Daniele Lubrano - @LBRDaniele
A special "thank you" to NearForm for sponsoring Lyra.