Save indexed data to local CSV files

Objective

This tutorial describes how to use the Subsquid indexing framework for saving processed blockchain data to local CSV files. The intent is to show how Subsquid SDK can be used for data analytics prototyping.

File-based data formats like CSV are convenient for data analysis, especially in the early prototyping stages. This convenience motivated Subsquid Team to develop some extensions to allow saving processed data to file-based storage.

We chose MATIC transactions on Ethereum mainnet for this example. This selection provides enough data to highlight the performance of the Subsquid framework and is interesting from a data analysis standpoint.

An article about this demo project has been published on Medium. The project source code can be found in the local-csv-indexing repository.

warning

As of 2023-12-17, the local-csv-indexing repo is mostly still sound, but already somewhat outdated. You can take a look at the less sophisticated, yet regularly updated example here.

Pre-requisites

Subsquid CLI
(optional) Python

Setup

Let's start by creating a new blockchain indexer, or "squid" in Subsquid terminology. In a terminal, launch this command:

sqd init local-csv-indexing -t evm

Here, local-csv-indexing is the name of the project, and can be changed to anything else. The -t evm option specifies that the evm template should be used as a starting point.

info

Note: The template actually has more than what we need for this project. Unnecessary packages have been removed in the tutorial repository. You can grab package.json from there to do the same.

Files-wise, docker-compose.yml, schema.graphql and squid.yaml were removed. commands.json, the list of local sqd scripts, has been significantly shortened (here is the updated version).

ERC-20 token ABI

To be able to index transfers of a token, it's necessary to know the address and the ABI (Application Binary Interface) of the token contract. The ABI defines the contract's functions and events, including their typed inputs and outputs.

Luckily, both of these can be found on block explorers like Etherscan.

The indexer needs the ABI for locating the contract events or functions in the EVM execution trace and decoding their inputs. Subsquid SDK has a handy command to generate some boilerplate TypeScript code to achieve this:

npx squid-evm-typegen src/abi 0x7D1AfA7B718fb893dB30A3aBc0Cfc608AaCfeBB0#matic

info

The typegen tool uses Etherscan API to fetch the contract ABI. Other compatible APIs are supported via the --etherscan-api flag. For example, if the contract was deployed to Polygon and its ABI was available from Polygonscan, it could still be fetched with the same command extended with --etherscan-api https://api.polygonscan.io.

Alternatilvely, the same command can be used with a path (local or URL) to a JSON ABI in place of the contract address.

This will generate some files under the src/abi folder, the most interesting of which is matic.ts.

CSV, Tables and Databases

For writing local CSVs we will need the file-store-csv package:

npm i @subsquid/file-store-csv

info

Packages are also available for writing to parquet files and for uploading to S3-compatible cloud services.

Next, let's create a new file at src/tables.ts. This is where it's possible to provide filenames for the CSV files, as well as configure their data structure, in much the same way as if they were a database table (the class name is no coincidence):

import {Table, Column, Types} from '@subsquid/file-store-csv'

export const Transfers = new Table(
  'transfers.csv',
  {
    blockNumber: Column(Types.Numeric()),
    timestamp: Column(Types.DateTime()),
    contractAddress: Column(Types.String()),
    from: Column(Types.String()),
    to: Column(Types.String()),
    amount: Column(Types.Numeric()),
  },
  {
    header: false,
  }
)

Let's create another file next, this time named src/db.ts, to configure the data abstraction layer. Here we export an instance of the Database class implementation from the file-store package (a dependency of file-store-csv). We will use this instance in much the same way as we would use a TypeormDatabase instance in a PostgreSQL-based squid.

import {Database, LocalDest, Store} from '@subsquid/file-store'
import { Transfers } from './tables'

export const db = new Database({
  tables: {
    Transfers,
  },
  dest: new LocalDest('./data'),
  chunkSizeMb: 100
})

info

Note the chunkSizeMb option. A new chunk (that is, a new folder with a new CSV file in it) will be written when either the amount of data stored in the processor buffer exceeds chunkSizeMb, or at the end of the batch during which ctx.store.setForceFlush() was called.

Data indexing

All the indexing logic is defined in src/processor.ts, so let's open it and edit the EvmBatchProcessor class configuration. We should request data for the right smart contract and the right EVM event log:

export const contractAddress =
  '0x7D1AfA7B718fb893dB30A3aBc0Cfc608AaCfeBB0'.toLowerCase();

const processor = new EvmBatchProcessor()
  .setGateway('https://v2.archive.subsquid.io/network/ethereum-mainnet')
  .setRpcEndpoint({
    url: process.env.RPC_ENDPOINT,
    rateLimit: 10
  })
  .setFinalityConfirmation(75)
  .addLog({
    address: [contractAddress],
    topic0: [events.Transfer.topic]
  });

warning

Note: the RPC_ENDPOINT environment variable is used, so make sure to edit the .env file and use a valid URL of an Ethereum RPC node, e.g.:

DB_NAME=squid
DB_PORT=23798
GQL_PORT=4350
# JSON-RPC node endpoint, both wss and https endpoints are accepted
RPC_ENDPOINT="<eth_rpc_endpoint_url>"

Let's define the logic to process a batch of EVM log data, and save it to CSV files.

A double loop is necessary to explore all the blocks in each batch and the logs in each block. The processor provides no guarantees that it won't provide any data not matching the filters, so it's necessary to check that the logs have been generated by the right address and that they have the right topic signature.

It's then possible to decode the events and prepare objects with the right data structure, which are then written to the Transfers CSV file. Here is a short summary of the logic:

processor.run(db, async (ctx) => {
  let decimals = 18
  for (let block of ctx.blocks) {
    for (let log of block.logs) {
      if (log.address !== contractAddress) continue;
      if (log.topic0 !== events.Transfer.topic) continue;

      const { from, to, value } = events.Transfer.decode(log);

      ctx.store.Transfers.write(({
        blockNumber: block.header.height,
        timestamp: new Date(block.header.timestamp),
        contractAddress: log.address,
        from: from.toLowerCase(),
        to: to.toLowerCase(),
        amount: BigDecimal(value.toBigInt(), decimals).toNumber(),
      });
    }
  }
});

info

The file in the GitHub repository is slightly different, as there's some added logic to obtain the number of decimals for the token. For that, the processor interacts with the smart contract deployed on chain.

Launch the project

Build, then launch the project with

npm run build

node -r dotenv/config lib/main.js

And in a few minutes, a few sub-folders (whose names are the block ranges where the data is coming from) should be created under the data directory, each containing a transfer.csv file. multiple folders containing CSV files

Data analysis with Python

If you want to learn how to analyze this data using Python and Pandas, refer to the Medium article dedicated to this demo project, or consult the project's repository on GitHub.

Conclusions

The purpose of this project was to demonstrate how to use Subsquid's indexing framework for data analytics prototyping: the indexer was able to ingest all this data, process it, and dump it to local CSV files in roughly 20 minutes (figure out of date). The simple Python script in the project's repository shows how to read multiple CSV files, and perform some data analysis with Pandas.

Subsquid Team seeks feedback on this new tool. If you want to share any thoughts or have any suggestions, feel free to reach out to us at the SquidDevs Telegram channel.

Save indexed data to local CSV files

Objective​

Pre-requisites​

Setup​

ERC-20 token ABI​

CSV, Tables and Databases​

Data indexing​

Launch the project​

Data analysis with Python​

Conclusions​

Objective

Pre-requisites

Setup

ERC-20 token ABI

CSV, Tables and Databases

Data indexing

Launch the project

Data analysis with Python

Conclusions