Target Case Study – Document Search

Goal

The goal of this exercise is to create a working program to search a set of documents for the given search term or phrase (single token), and return results in order of relevance. Relevancy is defined as number of times the exact term or phrase appears in the document.

Create three methods for searching the documents:

  • Simple string matching
  • Text search using regular expressions
  • Preprocess the content and then search the index
    Prompt the user to enter a search term and search method, execute the search, and return results. For instance:

Three files have been provided for you to read and use as sample search content.
Run a performance test that does 2M searches with random search terms, and measures execution time. Which approach is fastest? Why?
Provide some thoughts on what you would do on the software or hardware side to make this program scale to handle massive content and/or very large request volume (5000 requests/second or more).

    Enter the search term: <user enters search term>
    Search Method: 1) String Match 2) Regular Expression 3) Indexed
    Search results: 
        File2.txt – X matches
        File1.txt - X matches
        File3.txt – X matches
    Elapsed time: 40 ms

Usage and elapsed time

String Match

String Match

Regex Match

Regex Match

Index Match

Regex Match

Analysis

Benchmark test

  • Benchmark analysis:

goos: darwin
goarch: amd64
pkg: github.com/stashedup/target-case-study-doc-search
cpu: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
BenchmarkTestStringMatch_1_term-8    	      16	  67688474 ns/op
BenchmarkTestRegexMatch_1_term-8     	     174	   6945584 ns/op
BenchmarkTestIndexer_1_term-8        	  439076	      2553 ns/op

From my benchmark testing, it appear that index search is the fastest. That is because the index has been build prior to the search. Therefore the lookup time is O(1) compared to string and regex search that has O(n).

Loading documents

We first need to create a function to read all the documents dropped in a particular directory:
We set up a config file to read the location of the file storage.
The default is ./files , otherwise it will look for a file called – config.yaml

func ReadConfig() string {

	var directory = "./files"
	// Set the file name of the configurations file
	viper.SetConfigName("config")

	// Set the path to look for the configurations file
	viper.AddConfigPath(".")

	viper.SetConfigType("yaml")

	if err := viper.ReadInConfig(); err != nil {
		if _, ok := err.(viper.ConfigFileNotFoundError); ok {
			// Config file not found; ignore error if desired
			fmt.Println("config.yml not found")
			return directory
		} else {
			// Config file was found but another error was produced
		}
	}

	fmt.Printf("Files directory set to: %s\n", viper.GetString("directory"))
	directory = viper.GetString("directory")

	return directory
}

This is how the config.yaml file should look like:

directory: ./files

Next, we proceed to read the content of each file

func ReadDir(dirname string) []FileContent {
	files, err := ioutil.ReadDir(dirname)
	if err != nil {
		log.Fatal(err)
	}
	var filesContent []FileContent
	for _, file := range files {
		if !file.IsDir() { //if not a directory
			var content FileContent
			lines, err := readLines(filepath.Join(dirname, file.Name()))
			if err != nil {
				fmt.Println("There are some errors. Continuing search..")
			}
			content.Name = file.Name()
			content.Lines = lines
			filesContent = append(filesContent, content)
		}
	}
	return filesContent
}

func readLines(filepath string) (lines []string, err error) {

	var (
		file   *os.File
		part   []byte
		prefix bool
	)

	file, err = os.Open(filepath)
	if err != nil {
		return
	}
	defer file.Close()

	reader := bufio.NewReader(file)
	buffer := bytes.NewBuffer(make([]byte, 0))
	for {

		if part, prefix, err = reader.ReadLine(); err != nil {
			// fmt.Println(err)
			break
		}
		buffer.Write(part)
		if !prefix {
			//sliceData := strings.Split(buffer.String(), "\n")
			lines = append(lines, buffer.String())
			buffer.Reset()
		}
	}
	if err == io.EOF {
		err = nil
	}
	return
}

We will use this stucture for each file:

type FileContent struct {
	Name  string //file name (it should document ID in prod)
	Lines []string //all the lines in the file
	Count int //frequency of a particular word
}

Simple string matching

Simple string matching – matching each word in the lines(sentence) to the search term

//StringMatch engine : uses EqualFold  to match each word
func StringMatch(docs []string, term string) (count int) {

	for _, doc := range docs {
		sentence := analyze(doc)
		for _, word := range sentence {
			if strings.EqualFold(word, term) {
				count++
			}
		}
	}
	return
}

Regular Expression matching

This approach matches a regular expression to the entire line(sentence) not individual words like Simple string matching

  • (?i) makes the regex case-insensitive

  • \b matches a word boundary (position where one side is a word character and another side is not a word character)

//RegularExpression engine : uses MustCompile to match string in each line
func RegularExpression(docs []string, term string) int {

	re := regexp.MustCompile(`(?i)\b` + term + `\b`) // Don't do this in production, it's a security risk. term needs to be sanitized.
	var matches [][]int
	count := 0
	for _, doc := range docs {
		matches = re.FindAllStringIndex(doc, -1)
		count += len(matches)
	}
	return count
}

Pre-populated Index matching

When the program starts, it runs the init() function. This function will read all the documents and index it appropriately. This will take time upfront (eager approach), but shorter time to query search terms later. Both Regular Expression matching and Simple string matching are considered lazy approaches (learn more here).

  • Initialize the storage folder to be read

func init() {
	directory := ReadConfig()

	fmt.Println("Initialize index...")
	idx := make(index)
	filecontents := ReadDir(directory)
	for _, val := range filecontents {
		idx.Indexer(val.Lines, val.Name)
	}
	Id = idx
}
  • Indexing

func (idx index) Indexer(docs []string, ID string) {
	for _, doc := range docs {
		for _, token := range analyze(doc) {
			if properties, tokenExist := idx[token]; tokenExist {
				if count, documentSeen := properties[ID]; documentSeen {
					count++
					properties[ID] = count
				} else {
					properties[ID] = 1
				}
			} else {
				holder := make(map[string]int)
				holder[ID] = 1
				idx[token] = holder
			}
		}
	}
}

Relevant functions

  • Tokenizer
  • convert text into a list of tokens. Here, we splits the text on a word boundary and removes punctuation marks

//Tokenizer
func tokenize(text string) []string {
	return strings.FieldsFunc(text, func(r rune) bool {
		// Split on any character that is not a letter or a number.
		return !unicode.IsLetter(r) && !unicode.IsNumber(r)
	})
}
  • Filters (Normalization of text)
  • Lowercase
  • pengUin => penguin

func lowercaseFilter(tokens []string) []string {
	r := make([]string, len(tokens))
	for i, token := range tokens {
		r[i] = strings.ToLower(token)
	}
	return r
}
  • Drop common words

//words that are common - we can ignore
var commonwords = map[string]struct{}{
	"a": {}, "and": {}, "be": {}, "have": {}, "i": {},
	"in": {}, "of": {}, "that": {}, "the": {}, "to": {},
}

func commonwordFilter(tokens []string) []string {
	r := make([]string, 0, len(tokens))
	for _, token := range tokens {
		if _, ok := commonwords[token]; !ok {
			r = append(r, token)
		}
	}
	return r
  • Stemming
  • washing, washed and washer may be reduced to the base form (stem) wash.
  • Library

func stemmerFilter(tokens []string) []string {
	r := make([]string, len(tokens))
	for i, token := range tokens {
		r[i] = snowballeng.Stem(token, false)
	}
	return r
}

How do we scale?

Systems Design

  • Users can upload a file
  • Users can search for files based on frequency of the search term

Requirement and goal of the system

  • The average file size if 200KB
  • There could be 10K uploads a day
  • There could be 2M searches a day
  • Services would need to process 5000 request/second

Capacity Estimation and Constraints

  • Storage capacity: 10000 x 200KB = 2GB/day

High-level Design

  • At a high level, we need to store all the documents into a block storage database and also build an index that can keep track of which word appears in which document.

High Level Design

Detailed Component Design

  • Storage
    • 2GB x 365 days x 5years = 3.650TB
  • How do we create a system-wide unique Document IDs?
    • 10K new files each day
    • 10K x 365 days x 5 years = 18.25GB
    • lets asssume we have a service that can genrate a unique Document ID whenever we need to store an object. We can feed the Document ID to our hash function to find the storage server and store our documents there
  • We can partition out data based on – sharding based on the Document object:
    • while storing, we will pass the Document ID to our hash function to find the server and index all the words of the document on that server.

High Level Design

Workflow

  • Upload Process
    • User uploads a document, or multiple documents
    • Documents are queued to be processed
    • the Index Build Servers will
      • assign a unique Document ID to the document
      • index the words update the Index server with the new values
      • it will proceed to send the document to a storage server based on the Document ID (sharded value)
  • Search Process
    • User will send a search term to the Search Service
    • The Search Services will reach out to the Index servers to get most relevant document (get the Document ID)
    • With the Document ID (using our hash function), Search Service the retreive the document from the storage server

Low level Index Structure

Low level Index Structure

Fault Tolerance

  • What will happpen when an index server dies?
  • we can have a secondary replica of each server and if the primary server dies it can take control after the failover
  • Both primary and secondary will have the same copy of the index

Cache

  • to deal with hot searches (terms), we can introduce a cache infront of the Index servers

Load Balancing

  1. Between Searching users and Search Services
  2. Between Upload users and Upload Services

Conclusion

In conclusion, we have determined that using a prebuild index is the fastest way to search for a term in all the documents. While this code works for local proof of concept, we have to decouple some of these functions to separate services for a large scale system.

  • Hardware consideration

    • We need to consider a machine that has a significant memory since our indexing will require a lot of RAM
    • Index Builder Servers – r6g.8xlarge – $95.71/instance (minimum 3 in production)
    • Services (kubernetes cluster) – t4g.large – $33.73/instance (minimum 3 for each service in production)
    • cache servers – db.r6g.xlarge – $450.41/month (minimum 1)
    • ALBs – minimum $20/month
  • Software condideration

    • depending on where we end up on the index build application, we might consider 3rd party options such as elastic search or kibana see
    • for the services
      • frontend – JavaScipt (no preference)
      • backend – Go – I pick go because I have experience developing in Go and we can use the concurrency feature to speed up things

GitHub

View Github