Skip to content

Implement similarity search in Sourcify #2455

@marcocastignoli

Description

@marcocastignoli

Blockscout's similarity search

The Ethereum Bytecode Database Microservice blog post the only "filter" applied on the bytecode is execution vs auxdata parts. They simply split the two parts and use only the execution part of the bytecode to find already existing contracts. All other types of transformations are not included in the similarity search

Sourcify solution

We can:

  • given length and first 75 btyes of the candidate (we'll need an index on the first 75 bytes)
  • filter our database by equal bytes length AND same first 75 bytes
  • extract the result and generate the offsets on the fly starting from the artifacts (via js)
  • replace the offsets in the candidate and check if it matches with any of the results
  • if it matches then we can consider it a match (if also auxdata is equal then exact_match), and store it in the database without passing through standard verification flow.

doubt:

how do we know the transformations for the candidate? if turns out that this doubt is valid, then we need to call the verification function but only for the results' match.

I include here also Marco's proposal, from which we had the idea of using the artifacts' offsets

This is a simplified version of the table (for explanatory purposes), read the improvements section at the bottom

  • candidate: the raw runtime bytecode for which we want to find similar contracts.
  • normalized bytecodes: the available options to be matched with the candidate

We can maintain an additional table in which we use the compilation artifacts + auxdata position to store the position of the items below in the diff_offsets column. This table will contain only runtime bytecodes.

contract_id normalized_bytecode(bytea) diff_offsets(int4multirange)
1 123100000012ff0000000044 '{[4,10), [14,22)}'::int4multirange
  • Immutable Variables (only runtime code)
  • Libraries
  • CBOR Auxdata
  • Call Protection (only runtime code)

Finally we can implement a query like the one below to use the diff_offsets column to normalize on-the-fly the searched NON-normalized bytecode:

WITH cand AS (
  SELECT
    $1::bytea AS b,                         -- $1 is the input parameter; cast to BYTEA (not a function, a cast)
    length($1::bytea) AS n                  -- length(bytea): number of bytes in the candidate bytecode
)
SELECT
  b.id,
  b.normalized_bytecode,
  b.diff_offsets
FROM bytecode_norm AS b
JOIN cand c
  ON c.n = b.nbytes                         -- (no function) prefilter by total length
WHERE NOT EXISTS (                           -- NOT EXISTS: ensures there is NO mismatching kept segment
-- this part can be a bit tricky:
-- basically here we are putting the diff offsets of each row in the select
-- in a virtual table
-- then we are checking for each of these chunks if it's included in the candidate
  SELECT 1
  FROM unnest(                               -- unnest(anymultirange): expands a multirange into its constituent ranges
         (multirange(                        -- multirange(range): wraps a single range into a multirange
            int4range(0, b.nbytes)          -- int4range(lower, upper): constructs [lower, upper) over int4 (here [0, nbytes))
          )
          - COALESCE(
              b.diff_offsets,                --   diff_offsets is an int4multirange of masked (ignored) byte spans
              '{}'::int4multirange          --   empty multirange when there are no diffs
            )
        )
       ) AS r
  WHERE substr(                               -- substr(bytea, start, length): slice of BYTEA; start is 1-based
          c.b,
          lower(r) + 1,                      -- lower(range): lower bound of range r (0-based); +1 to convert to 1-based for substr
          upper(r) - lower(r)                -- upper(range): upper bound of r; (upper - lower) is the segment length
        )
     <> substr(                               -- compare the same kept slice of the stored normalized bytecode
          b.normalized_bytecode,
          lower(r) + 1,
          upper(r) - lower(r)
        )
);

Improvements

  • we can store the sha256 of the bytecode instead of the full bytecode
    • the query will change, but I think it's possible. we just replace all the ranges in the candidate, then hash it, then compare it with the normalized.
  • we can skip storing the normalized bytecode and use the one in contracts (since it's linked)
  • support also creation, but we need to somehow handle constructor arguments
    • potentially we can omit constr_args in the normalized, and in the filter we can use ON c.n >= b.nbytes
    • also, we can add a column to mark the bytecode as runtime or creation to filter more normalized_bytecodes.
  • before trying to filter via ranges, we can see if the candidate matches perfectly an existing normalized contract (excluding the auxdata part):
    • add an additional column in the candidates table execution_bytecode_hash, that is the bytecode hashed
    • before replacing the whole artifacts offset, replace only the auxdata (by using the auxdata offset of the candidate) and compare it with execution_bytecode_hash

TODO

  • Check how many contracts are returned after filter by same length and first 75 bytes
  • Add an index of first 75 bytes (staging + prod) and see worst case and typical query performances.
    • With same length
    • Without same length
  • Run new Verify(preRunCompilation, SourcifyChain, address).verify() 300 times and see performance.
  • Decide if we should implement also a metadata hash check against db to skip compilation

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    Sprint - Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions