Round Table Discussion November 15, 2022

Open discussions on specific topics selected by the Software Working Group and selected from the list of SWG Topics For Discussion.

Tuesday, November 15, 2022 - Search Engine Indexing - Dipannita Dey

A search engine helps users quickly find information across a large dataset. It involves building efficient indices for fast retrieval of information.

Recording: https://uofi.box.com/s/4f2zucbgayj3bpvrbx2nisz0mun1d9vd

Slides: https://uofi.box.com/s/vesc3e8liw78e1uh4inore72i3kb6f1i

Attendees:

Sandeep Puthanveetil Satheesan

Santiago Nunez-Corrales

Discussion:

There are several test based search platforms MongoDb, Apache Solar, Elasticsearch, Sphinx.

Discussion about which tool works best in various conditions.

Sandeep adds that limited searches works with Mongo Db.
Mike Bobak asks if Indexing can be done with MongoDb in the background, which will speed up the search
Elasticsearch is backed by an active community so there is help available
Sphinx needs managed schema.
In depth talk on Elasticsearch, which is chosen as the tool to use in Clowder
- Visual analysis on Kibana, but it is a paid subscription
You can talk securely across remote clusters
A description of indexing in Elasticsearch was discussed; it is also customizable.
Max notes that Elasticsearch is not so good with hyphens, semi-colons, etc and won't catch what you expect it to catch.
Clowder Searchbox for V2
If you have a lot of fields, you need to determine which fields you want to search rather than search all of the fields.
Discussion of facets in Clowder.
Two different kinds of mapping - explicit mapping and dynamic mapping.
There are challenges with Elasticsearch
- metadata schema is not always known beforehand. It's best to provide the schema ahead of time.
- Once is has assigned a data type to a field, any incoming metadata that cannot be case as this type will be rejected
- Quoting terms - operators supported with quotes are based on Lucene syntax. There are other complex operators like "contains" and wildcards.
  - We want to make searches very user friendly.
  - Parsing user syntax - there is a help page that will help the user determine which syntax to use in their search.
- Permissions in Elasticsearch
  - user permissions are not stored
    - how do we store authorizations?
  - pagination may request a certain number of results. If there are 50 results, but the user only has permission to see 6 of them, V1 will continuously fetch more results until all records are checked.
- Federated search
  - query and results - a federated search will work with Clowder V1 through Clowder Vn through MongoDb

If you are interested in contributing to a Round Table, please see these links:

Round Table Google Sheet: https://docs.google.com/spreadsheets/d/1kbgO6sIb_4eLugfSVKQNCTXdaKp1R6m0RDczPTsUAoQ/edit#gid=0 Every one should have edit permission.

Round Table Discussions

SWG Topics For Discussion

Space shortcuts

Page tree

Open discussions on specific topics selected by the Software Working Group and selected from the list of SWG Topics For Discussion.

Tuesday, November 15, 2022 - Search Engine Indexing - Dipannita Dey