Auto-indexing inference configuration reference
This document details how a site administrator can supply a Lua script to customize the way Sourcegraph detects precise code intelligence indexing jobs from repository contents.
By default, Sourcegraph will attempt to infer index jobs for the following languages:
- Go
- Java/- Scala/- Kotlin
- Python
- Ruby
- Rust
- TypeScript/- JavaScript
Inference logic can be disabled or altered in the case when the target repositories do not conform to a pattern that the Sourcegraph default inference logic recognizes. Inference logic is controlled by a Lua override script that can be supplied in the UI under Admin > Code graph > Inference.
Example
The Lua override script ultimately must return an auto-indexing config object. A configuration that neither disables or adds new recognizers does not change the default inference behavior.
LUAreturn require("sg.autoindex.config").new({ -- Empty configuration (see below for usage) })
To disable default behaviors, you can re-assign a recognizer value to false. Each of the built-in recognizers are prefixed with sg. (and are the only ones allowed to be).
LUAreturn require("sg.autoindex.config").new({ -- Disable default Python inference ["sg.python"] = false })
To add additional behaviors, you can create and register a new recognizer. A recognizer is an interface that requests some set of files from a repository, and returns a set of auto-indexing job configurations that could produce a precise code intelligence index.
A path recognizer is a concrete recognizer that advertises a set of path globs it is interested in, then invokes its generate function with matching paths from a repository. In the following, all files matching Snek.module (Snek.module, proj/Snek.module, proj/sub/Snek.module, etc) are passed to a call to generate (if non-empty). The generate function will then return a list of indexing job descriptions. The guide for auto-indexing jobs configuration gives detailed descriptions on the fields of this object.
The ordering of paths and limits are defined in the Ordering guarantees and limits section.
LUAlocal path = require("path") local pattern = require("sg.autoindex.patterns") local recognizer = require("sg.autoindex.recognizer") local snek_recognizer = recognizer.new_path_recognizer { patterns = { -- Look for Snek.module files -- (would match Snek.module; proj/Snek.module, proj/sub/Snek.module, etc) pattern.new_path_basename("Snek.module"), -- Ignore any files in test or vendor directories pattern.new_path_exclude( pattern.new_path_segment("test"), pattern.new_path_segment("vendor") ), }, -- Called with list of matching Snek.module files generate = function(_, paths) local jobs = {} for i = 1, #paths do -- Create indexing job description for each matching file table.insert(jobs, { indexer = "acme/snek:latest", -- Run this indexer... root = path.dirname(paths[i]), -- ...in this directory local_steps = {"snekpm install"}, -- Install dependencies indexer_args = {"snek", "index", ".", "--output", "index.scip"}, outfile = "index.scip", }) end return jobs end } return require("sg.autoindex.config").new({ -- Register new recognizer ["acme.snek"] = snek_recognizer, })
Available libraries
There are a number of specific and general-purpose Lua libraries made accessible via the built-in require.
The type signatures for the functions below use the following syntax:
- (A1, ..., An) -> R: Function type with arguments of type- A1, ..., Anand return type- R.
- array[A]: Table with indexes 1 to N of elements of type- A.
- table[K, V]: Table with keys of type- Kand values of type- V.
- A | B: Union type (includes values of type- Aand type- B).
- A...: Variadic (0 or more values of A, without being wrapped in a table).
- "mystring": Literal string type with only- "mystring"as the allowed value.
- {K1: V1, K2: V2, ...}: Heterogenous table (object) with a key of type- K1mapping to a value of type- V1etc.
- void: no value returned from function
sg.autoindex.recognizer
This auto-indexing-specific library defines the following two functions.
- 
new_path_recognizercreates aRecognizerfrom a config object containingpatternsandgeneratefields. See the example above for basic usage.- Type:
whereSHELL({ -- List of patterns to match against paths in the repository "patterns": array[pattern], -- List of patterns to match against paths in the repository -- for getting contents (see contents_by_path below) "patterns_for_content": array[pattern], -- Callback function invoked with paths requested by patterns above -- for creating index jobs "generate": ( registration_api, -- List of paths obtained from 'patterns' and -- 'patterns_for_content' combined. paths: array[string], -- Table mapping paths to contents for paths matched by -- 'patterns_for_content' contents_by_path: table[string, string] ) -> array[index_job], }) -> recognizerindex_jobis an object with the following shape:For installing dependencies, if the indexer image contains the relevant package manager(s), then it is simpler to install dependencies usingSHELLindex_job = { -- Docker image for the indexer "indexer": string, -- Working directory for invoking the indexer "root": string, -- Preparatory steps to run before invoking the indexer -- such as installing dependencies "steps": array[{ -- Working directory for this step "root": string, -- Docker image to use for this step "image": string, -- List of commands to run inside the Docker image "commands": array[string] }], -- List of commands to run inside the indexer image at "root" -- before invoking the indexer, such as installing dependencies. "local_steps": array[string], -- Command-line invocation for the indexer "indexer_args": array[string], -- Path to the index generated by the indexer "outfile": string, -- Names of necessary environment variables. These are -- made accessible to steps, local_steps, and the -- indexer_args command. -- -- These are generally used for passing secrets. "requested_envvars": array[string], }local_steps. Otherwise, thestepsfield allows more customizability.
 
- Type:
- 
new_fallback_recognizercreates arecognizerfrom an ordered list ofrecognizers. Eachrecognizeris called sequentially, until one of them emits non-empty results.- Type: (array[recognizer]) -> recognizer
 
- Type: 
The registration_api object has the following API:
- registerwhich queues a- recognizerto be run at a later stage. This makes it possible to add more recognizers dynamically, such as based on whether specific configuration files were found or not.- Type: (recognizer) -> void
 
- Type: 
sg.autoindex.patterns
This auto-indexing-specific library defines the following four path pattern constructors.
- new_path_literal(fullpath)creates a- patternthat matches an exact filepath.- Type: (string) -> pattern
 
- Type: 
- new_path_segment(segment)creates a- patternthat matches a directory name.- Type: (string) -> pattern
 
- Type: 
- new_path_basename(basename)creates a- patternthat matches a basename exactly.- Type: (string) -> pattern
 
- Type: 
- new_path_extension(ext_no_leading_dot)creates a- patternthat matches files with a given extension.- Type: (string) -> pattern
 
- Type: 
This library also defines the following two pattern collection constructors.
- new_path_combine(patterns)creates a pattern collection object (to be used with recognizers) from the given set of path- patterns.- Type: ((pattern | array[pattern])...) -> pattern
 
- Type: 
- new_path_exclude(patterns)creates a new inverted pattern collection object. Paths matching these- patterns are filtered out from the set of matching filepaths given to a recognizer's- generatefunction.- Type: ((pattern | array[pattern])...) -> pattern
 
- Type: 
path
This library defines the following utility functions:
- ancestors(path)returns a list- {dirname(path), dirname(dirname(path)), ...}. The last element in the list will be an empty string.- Type: (string) -> array[string]
 
- Type: 
- basename(path)returns the basename of the given path as defined by Go's filepath.Base.- Type: (string) -> string
 
- Type: 
- dirname(path)returns the dirname of the given path as defined by Go's filepath.Dir, except that it (1) returns an empty path instead of- "."if the path is empty and (2) removes a leading- /if present.- Type: string -> string
 
- Type: 
- join(path1, path2)returns a filepath created by joining the given path segments via filepath separator.- Type: (string, string) -> string
 
- Type: 
- split(path)is a convenience function that returns- dirname(path), basename(path).- Type: (string) -> string, string
 
- Type: 
json
This library defines the following two JSON utility functions:
- encode(val)returns a JSON-ified version of the given Lua object.
- decode(json)returns a Lua table representation of the given JSON text.
fun
Lua Functional is a high-performance functional programming library accessible via local fun = require("fun"). This library has a number of functional utilities to help make recognizer code a bit more expressive.
Ordering guarantees and limits
Sourcegraph enforces several limits to avoid inference timeouts and ever-growing auto-indexing queues. These limits apply for a single round of inference for a single repository, combined across all recognizers, including any implicitly included Sourcegraph recognizers.
| Limit | Default value | 
|---|---|
| The number of auto-indexing jobs inferred | 100 | 
| The number of total paths passed to the inference script's generatefunctions as the second argumentpaths | 500 | 
| The number of total paths with contents passed to the inference script's generatefunctions as the third argumentcontents_by_paths | 100 | 
| Maximum size limit for file contents, in bytes | 1 MiB | 
Auto-indexing jobs and paths are first ranked based on the criteria described below. If the number of jobs and/or paths exceeds the limits above, lower ranked items are discarded.
- 
For auto-indexing jobs, ranking is done based on the following: - Descending order of indexer frequency (total number of inferred jobs with the same indexerfield).
- Ascending lexicographic ordering of indexer.
- Descending order of number of path components for root. Shallower roots are preferrred over deeper ones as they are more likely to cover more code.
- Ascending lexicographic ordering of rootpaths.
 
- Descending order of indexer frequency (total number of inferred jobs with the same 
- 
For paths, ranking happens in the following order: - Paths for which the contents are requested are ranked higher.
- Paths with fewer components are ranked higher.
- Otherwise, lexicographic ordering of paths is used.