Open discussions on specific topics selected by the Software Working Group and selected from the list of SWG Topics For Discussion.

Tuesday, November 15, 2022 - Search Engine Indexing - Dipannita Dey

A search engine helps users quickly find information across a large dataset. It involves building efficient indices for fast retrieval of information.

 


Recording:  https://uofi.box.com/s/4f2zucbgayj3bpvrbx2nisz0mun1d9vd

Slides:  https://uofi.box.com/s/vesc3e8liw78e1uh4inore72i3kb6f1i


Attendees:

Luigi Marini 

Dipannita Dey 

Maxwell Burnette 

Minu Mathew 

Charles Blatti 

Chen Wang 

Christopher Navarro 

Matt Berry 

Michael Bobak 

Mikolaj Kowalik 

Sandeep Puthanveetil Satheesan 

Santiago Nunez-Corrales 

Sara Lambert 

Todd Nicholson 

Bing Zhang

Nathan Tolbert 

Elizabeth Yanello 

Douglas Friedel 


Discussion:

There are several test based search platforms MongoDb, Apache Solar, Elasticsearch, Sphinx. 

Discussion about which tool works best in various conditions.

  • Sandeep adds that limited searches works with Mongo Db.
  • Mike Bobak asks if Indexing can be done with MongoDb in the background, which will speed up the search
  • Elasticsearch is backed by an active community so there is help available
  • Sphinx needs managed schema.
  • In depth talk on Elasticsearch, which is chosen as the tool to use in Clowder
    • Visual analysis on Kibana, but it is a paid subscription
  • You can talk securely across remote clusters
  • A description of indexing in Elasticsearch was discussed; it is also customizable.
  • Max notes that Elasticsearch is not so good with hyphens, semi-colons, etc and won't catch what you expect it to catch.
  • Clowder Searchbox for V2
  • If you have a lot of fields, you need to determine which fields you want to search rather than search all of the fields.
  • Discussion of facets in Clowder.
  • Two different kinds of mapping - explicit mapping and dynamic mapping.
  • There are challenges with Elasticsearch
    • metadata schema is not always known beforehand.  It's best to provide the schema ahead of time.
    • Once is has assigned a data type to a field, any incoming metadata that cannot be case as this type will be rejected
    • Quoting terms - operators supported with quotes are based on Lucene syntax.  There are other complex operators like "contains" and wildcards.
      • We want to make searches very user friendly.
      • Parsing user syntax - there is a help page that will help the user determine which syntax to use in their search.
    • Permissions in Elasticsearch
      • user permissions are not stored
        • how do we store authorizations?
      • pagination may request a certain number of results.  If there are 50 results, but the user only has permission to see 6 of them, V1 will continuously fetch more results until all records are checked.
    • Federated search
      • query and results - a federated search will work with Clowder V1 through Clowder Vn through MongoDb






If you are interested in contributing to a Round Table, please see these links:

Round Table Google Sheet: https://docs.google.com/spreadsheets/d/1kbgO6sIb_4eLugfSVKQNCTXdaKp1R6m0RDczPTsUAoQ/edit#gid=0  Every one should have edit permission.

Round Table Discussions

SWG Topics For Discussion




  • No labels