Open discussions on specific topics selected by the Software Working Group and selected from the list of SWG Topics For Discussion.
Tuesday, November 15, 2022 - Search Engine Indexing - Dipannita Dey
A search engine helps users quickly find information across a large dataset. It involves building efficient indices for fast retrieval of information.
Recording: https://uofi.box.com/s/4f2zucbgayj3bpvrbx2nisz0mun1d9vd
Slides: https://uofi.box.com/s/vesc3e8liw78e1uh4inore72i3kb6f1i
Attendees:
Sandeep Puthanveetil Satheesan
Discussion:
There are several test based search platforms MongoDb, Apache Solar, Elasticsearch, Sphinx.
Discussion about which tool works best in various conditions.
- Sandeep adds that limited searches works with Mongo Db.
- Mike Bobak asks if Indexing can be done with MongoDb in the background, which will speed up the search
- Elasticsearch is backed by an active community so there is help available
- Sphinx needs managed schema.
- In depth talk on Elasticsearch, which is chosen as the tool to use in Clowder
- Visual analysis on Kibana, but it is a paid subscription
- You can talk securely across remote clusters
- A description of indexing in Elasticsearch was discussed; it is also customizable.
- Max notes that Elasticsearch is not so good with hyphens, semi-colons, etc and won't catch what you expect it to catch.
- Clowder Searchbox for V2
- If you have a lot of fields, you need to determine which fields you want to search rather than search all of the fields.
- Discussion of facets in Clowder.
- Two different kinds of mapping - explicit mapping and dynamic mapping.
- There are challenges with Elasticsearch
- metadata schema is not always known beforehand. It's best to provide the schema ahead of time.
- Once is has assigned a data type to a field, any incoming metadata that cannot be case as this type will be rejected
- Quoting terms - operators supported with quotes are based on Lucene syntax. There are other complex operators like "contains" and wildcards.
- We want to make searches very user friendly.
- Parsing user syntax - there is a help page that will help the user determine which syntax to use in their search.
- Permissions in Elasticsearch
- user permissions are not stored
- how do we store authorizations?
- pagination may request a certain number of results. If there are 50 results, but the user only has permission to see 6 of them, V1 will continuously fetch more results until all records are checked.
- user permissions are not stored
- Federated search
- query and results - a federated search will work with Clowder V1 through Clowder Vn through MongoDb
If you are interested in contributing to a Round Table, please see these links:
Round Table Google Sheet: https://docs.google.com/spreadsheets/d/1kbgO6sIb_4eLugfSVKQNCTXdaKp1R6m0RDczPTsUAoQ/edit#gid=0 Every one should have edit permission.