Install Ethereum ETL:
pip3 install ethereum-etlExport blocks and transactions (Schema, Reference):
> ethereumetl export_blocks_and_transactions --start-block 0 --end-block 500000 \
--provider-uri https://mainnet.infura.io --blocks-output blocks.csv --transactions-output transactions.csvExport ERC20 and ERC721 transfers (Schema, Reference):
> ethereumetl export_token_transfers --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output token_transfers.csvExport traces (Schema, Reference):
> ethereumetl export_traces --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/parity.ipc --output traces.csvStream blocks, transactions, logs, token_transfers continually to console (Reference):
> pip3 install ethereum-etl[streaming]
> ethereumetl stream --start-block 500000 -e block,transaction,log,token_transfer --log-file log.txtFind other commands here.
For the latest version, check out the repo and call
> pip3 install -e .
> python3 ethereumetl.py- Schema
- Exporting the Blockchain
- Ethereum Classic Support
- Querying in Amazon Athena
- Querying in Google BigQuery
- Querying in Kaggle
- Blockchain ETL in Media
| Column | Type |
|---|---|
| number | bigint |
| hash | hex_string |
| parent_hash | hex_string |
| nonce | hex_string |
| sha3_uncles | hex_string |
| logs_bloom | hex_string |
| transactions_root | hex_string |
| state_root | hex_string |
| receipts_root | hex_string |
| miner | address |
| difficulty | numeric |
| total_difficulty | numeric |
| size | bigint |
| extra_data | hex_string |
| gas_limit | bigint |
| gas_used | bigint |
| timestamp | bigint |
| transaction_count | bigint |
| Column | Type |
|---|---|
| hash | hex_string |
| nonce | bigint |
| block_hash | hex_string |
| block_number | bigint |
| transaction_index | bigint |
| from_address | address |
| to_address | address |
| value | numeric |
| gas | bigint |
| gas_price | bigint |
| input | hex_string |
| block_timestamp | bigint |
| Column | Type |
|---|---|
| token_address | address |
| from_address | address |
| to_address | address |
| value | numeric |
| transaction_hash | hex_string |
| log_index | bigint |
| block_number | bigint |
| Column | Type |
|---|---|
| transaction_hash | hex_string |
| transaction_index | bigint |
| block_hash | hex_string |
| block_number | bigint |
| cumulative_gas_used | bigint |
| gas_used | bigint |
| contract_address | address |
| root | hex_string |
| status | bigint |
| Column | Type |
|---|---|
| log_index | bigint |
| transaction_hash | hex_string |
| transaction_index | bigint |
| block_hash | hex_string |
| block_number | bigint |
| address | address |
| data | hex_string |
| topics | string |
| Column | Type |
|---|---|
| address | address |
| bytecode | hex_string |
| function_sighashes | string |
| is_erc20 | boolean |
| is_erc721 | boolean |
| block_number | bigint |
| Column | Type |
|---|---|
| address | address |
| symbol | string |
| name | string |
| decimals | bigint |
| total_supply | numeric |
| Column | Type |
|---|---|
| block_number | bigint |
| transaction_hash | hex_string |
| transaction_index | bigint |
| from_address | address |
| to_address | address |
| value | numeric |
| input | hex_string |
| output | hex_string |
| trace_type | string |
| call_type | string |
| reward_type | string |
| gas | bigint |
| gas_used | bigint |
| subtraces | bigint |
| trace_address | string |
| error | string |
| status | bigint |
You can find column descriptions in https://github.com/medvedev1088/ethereum-etl-airflow
Note: for the address type all hex characters are lower-cased.
boolean type can have 2 values: True or False.
- In case the contract is a proxy, which forwards all calls to a delegate, interface detection doesn’t work,
which means
is_erc20andis_erc721will always be false for proxy contracts and they will be missing in thetokenstable. - The metadata methods (
symbol,name,decimals,total_supply) for ERC20 are optional, so around 10% of the contracts are missing this data. Also some contracts (EOS) implement these methods but with wrong return type, so the metadata columns are missing in this case as well. token_transfers.value,tokens.decimalsandtokens.total_supplyhave typeSTRINGin BigQuery tables, because numeric types there can't handle 32-byte integers. You should usecast(value as FLOAT64)(possible loss of precision) orsafe_cast(value as NUMERIC)(possible overflow) to convert to numbers.- The contracts that don't implement
decimals()function but have the fallback function that returns abooleanwill have0or1in thedecimalscolumn in the CSVs.
If you'd like to have the blockchain data platform set up and hosted for you in AWS or GCP, get in touch with us here.
-
Install python 3.5.3+ https://www.python.org/downloads/
-
You can use Infura if you don't need ERC20 transfers (Infura doesn't support eth_getFilterLogs JSON RPC method). For that use
-p https://mainnet.infura.iooption for the commands below. If you need ERC20 transfers or want to export the data ~40 times faster, you will need to set up a local Ethereum node: -
Install geth https://github.com/ethereum/go-ethereum/wiki/Installing-Geth
-
Start geth. Make sure it downloaded the blocks that you need by executing
eth.syncingin the JS console. You can export blocks belowcurrentBlock, there is no need to wait until the full sync as the state is not needed (unless you also need contracts bytecode and token details; for those you need to wait until the full sync). -
Install Ethereum ETL:
> pip3 install ethereum-etl -
Export all:
> ethereumetl export_all --help > ethereumetl export_all -s 0 -e 5999999 -b 100000 -p file://$HOME/Library/Ethereum/geth.ipc -o output
In case
ethereumetlcommand is not available in PATH, usepython3 -m ethereumetlinstead.The result will be in the
outputsubdirectory, partitioned in Hive style:output/blocks/start_block=00000000/end_block=00099999/blocks_00000000_00099999.csv output/blocks/start_block=00100000/end_block=00199999/blocks_00100000_00199999.csv ... output/transactions/start_block=00000000/end_block=00099999/transactions_00000000_00099999.csv ... output/token_transfers/start_block=00000000/end_block=00099999/token_transfers_00000000_00099999.csv ...
Should work with geth and parity, on Linux, Mac, Windows.
If you use Parity you should disable warp mode with --no-warp option because warp mode
does not place all of the block or receipt data into the database https://wiki.parity.io/Getting-Synced
If you see weird behavior, e.g. wrong number of rows in the CSV files or corrupted files, check out this issue: blockchain-etl#28
You can use AWS Auto Scaling and Data Pipeline to reduce the exporting time to a few hours. Read this article for details https://medium.com/@medvedev1088/how-to-export-the-entire-ethereum-blockchain-to-csv-in-2-hours-for-10-69fef511e9a2
-
Install Docker https://docs.docker.com/install/
-
Build a docker image
> docker build -t ethereum-etl:latest . > docker image ls
-
Run a container out of the image
> docker run -v $HOME/output:/ethereum-etl/output ethereum-etl:latest export_all -s 0 -e 5499999 -b 100000 -p https://mainnet.infura.io > docker run -v $HOME/output:/ethereum-etl/output ethereum-etl:latest export_all -s 2018-01-01 -e 2018-01-01 -p https://mainnet.infura.io
-
Run streaming to console or Pub/Sub
> docker build -t ethereum-etl:latest-streaming -f Dockerfile_with_streaming . > echo "Stream to console" > docker run ethereum-etl:latest-streaming stream --start-block 500000 --log-file log.txt > echo "Stream to Pub/Sub" > docker run -v /path_to_credentials_file/:/ethereum-etl/ --env GOOGLE_APPLICATION_CREDENTIALS=/ethereum-etl/credentials_file.json ethereum-etl:latest-streaming stream --start-block 500000 --output projects/<your-project>/topics/crypto_ethereum
- export_blocks_and_transactions
- export_token_transfers
- extract_token_transfers
- export_receipts_and_logs
- export_contracts
- export_tokens
- export_traces
- export_geth_traces
- extract_geth_traces
- get_block_range_for_date
- get_keccak_hash
- stream
All the commands accept -h parameter for help, e.g.:
> ethereumetl export_blocks_and_transactions -h
Usage: ethereumetl export_blocks_and_transactions [OPTIONS]
Export blocks and transactions.
Options:
-s, --start-block INTEGER Start block
-e, --end-block INTEGER End block [required]
-b, --batch-size INTEGER The number of blocks to export at a time.
-p, --provider-uri TEXT The URI of the web3 provider e.g.
file://$HOME/Library/Ethereum/geth.ipc or
https://mainnet.infura.io
-w, --max-workers INTEGER The maximum number of workers.
--blocks-output TEXT The output file for blocks. If not provided
blocks will not be exported. Use "-" for stdout
--transactions-output TEXT The output file for transactions. If not
provided transactions will not be exported. Use
"-" for stdout
-h, --help Show this message and exit.For the --output parameters the supported types are csv and json. The format type is inferred from the output file name.
> ethereumetl export_blocks_and_transactions --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc \
--blocks-output blocks.csv --transactions-output transactions.csvOmit --blocks-output or --transactions-output options if you want to export only transactions/blocks.
You can tune --batch-size, --max-workers for performance.
Blocks and transactions schema.
The API used in this command is not supported by Infura, so you will need a local node. If you want to use Infura for exporting ERC20 transfers refer to extract_token_transfers
> ethereumetl export_token_transfers --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --batch-size 100 --output token_transfers.csvInclude --tokens <token1> --tokens <token2> to filter only certain tokens, e.g.
> ethereumetl export_token_transfers --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output token_transfers.csv \
--tokens 0x86fa049857e0209aa7d9e616f7eb3b3b78ecfdb0 --tokens 0x06012c8cf97bead5deae237070f9587f8e7a266dYou can tune --batch-size, --max-workers for performance.
First extract transaction hashes from transactions.csv
(Exported with export_blocks_and_transactions):
> ethereumetl extract_csv_column --input transactions.csv --column hash --output transaction_hashes.txtThen export receipts and logs:
> ethereumetl export_receipts_and_logs --transaction-hashes transaction_hashes.txt \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --receipts-output receipts.csv --logs-output logs.csvOmit --receipts-output or --logs-output options if you want to export only logs/receipts.
You can tune --batch-size, --max-workers for performance.
Upvote this feature request openethereum/parity-ethereum#9075, it will make receipts and logs export much faster.
First export receipt logs with export_receipts_and_logs.
Then extract transfers from the logs.csv file:
> ethereumetl extract_token_transfers --logs logs.csv --output token_transfers.csvYou can tune --batch-size, --max-workers for performance.
First extract contract addresses from receipts.csv
(Exported with export_receipts_and_logs):
> ethereumetl extract_csv_column --input receipts.csv --column contract_address --output contract_addresses.txtThen export contracts:
> ethereumetl export_contracts --contract-addresses contract_addresses.txt \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output contracts.csvYou can tune --batch-size, --max-workers for performance.
First extract token addresses from contracts.json
(Exported with export_contracts):
> ethereumetl filter_items -i contracts.json -p "item['is_erc20'] or item['is_erc721']" | \
ethereumetl extract_field -f address -o token_addresses.txtThen export ERC20 / ERC721 tokens:
> ethereumetl export_tokens --token-addresses token_addresses.txt \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output tokens.csvYou can tune --max-workers for performance.
Also called internal transactions.
The API used in this command is not supported by Infura,
so you will need a local Parity archive node (parity --tracing on).
Make sure your node has at least 8GB of memory, or else you will face timeout errors.
See this issue
> ethereumetl export_traces --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/parity.ipc --batch-size 100 --output traces.csvYou can tune --batch-size, --max-workers for performance.
Read Differences between geth and parity traces.csv
The API used in this command is not supported by Infura,
so you will need a local Geth archive node (geth --gcmode archive --syncmode full --ipcapi debug).
When using rpc, add --rpc --rpcapi debug options.
> ethereumetl export_geth_traces --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --batch-size 100 --output geth_traces.jsonYou can tune --batch-size, --max-workers for performance.
> ethereumetl extract_geth_traces --input geth_traces.json --output traces.csvYou can tune --batch-size, --max-workers for performance.
> ethereumetl get_block_range_for_date --provider-uri=https://mainnet.infura.io --date 2018-01-01
4832686,4838611> ethereumetl get_keccak_hash -i "transfer(address,uint256)"
0xa9059cbb2ab09eb219583f4a59a5d0623ade346d962bcd4e46b11da047c9049b> pip3 install ethereum-etl[streaming]
> ethereumetl stream --provider-uri https://mainnet.infura.io --start-block 500000- This command outputs blocks, transactions, logs, token_transfers to the console by default.
- Entity types can be specified with the
-eoption, e.g.-e block,transaction,log,token_transfer,trace,contract,token. - Use
--outputoption to specify the Google Pub/Sub topic where to publish blockchain data, e.g.projects/<your-project>/topics/bitcoin_blockchain. Data will be pushed toprojects/<your-project>/topics/bitcoin_blockchain.blocks,projects/<your-project>/topics/bitcoin_blockchain.transactionsetc. topics. - The command saves its state to
last_synced_block.txtfile where the last synced block number is saved periodically. - Specify either
--start-blockor--last-synced-block-fileoption.--last-synced-block-fileshould point to the file where the block number, from which to start streaming the blockchain data, is saved. - Use the
--lagoption to specify how many blocks to lag behind the head of the blockchain. It's the simplest way to handle chain reorganizations - they are less likely the further a block from the head. - You can tune
--period-seconds,--batch-size,--block-batch-size,--max-workersfor performance. - Refer to blockchain-etl-streaming for instructions on deploying it to Kubernetes.
Stream blockchain data continually to Google Pub/Sub:
> export GOOGLE_APPLICATION_CREDENTIALS=/path_to_credentials_file.json
> ethereumetl stream --start-block 500000 --output projects/<your-project>/topics/crypto_ethereum> pip3 install -e .[dev,streaming]
> export ETHEREUM_ETL_RUN_SLOW_TESTS=True
> pytest -vv> pip3 install tox
> toxFor getting ETC csv files, make sure you pass in the --chain classic param where it's required for the scripts you want to export.
ETC won't run if your --provider-uri is Infura. It will provide a warning and change the provider-uri to https://ethereumclassic.network instead. For faster performance, run a client instead locally for classic such as parity chain=classic and Geth-classic.
to_addressfield differs forcallcodetrace (geth seems to return correct value, as parity value ofto_addressis same asto_addressof parent call);- geth output doesn't have
rewardtraces; - geth output doesn't have
to_address,from_address,valueforsuicidetraces; errorfield contains human readable error message, which might differ in geth/parity output;- geth output doesn't have
transaction_hash; gas_usedis 0 on traces with error in geth, empty in parity;- zero output of subcalls is
0x000...in geth,0xin parity;
- Upload the files to S3:
> cd output
> aws s3 sync . s3://<your_bucket>/ethereumetl/export --region ap-southeast-1-
Sign in to Athena https://console.aws.amazon.com/athena/home
-
Create a database:
CREATE DATABASE ethereumetl;- Create the tables:
- blocks: schemas/aws/blocks.sql
- transactions: schemas/aws/transactions.sql
- token_transfers: schemas/aws/token_transfers.sql
- contracts: schemas/aws/contracts.sql
- receipts: schemas/aws/receipts.sql
- logs: schemas/aws/logs.sql
- tokens: schemas/aws/tokens.sql
Refer to https://github.com/medvedev1088/ethereum-etl-airflow for the instructions.
Read this article on how to convert CSVs to Parquet https://medium.com/@medvedev1088/converting-ethereum-etl-files-to-parquet-399e048ddd30
- Create the tables:
- parquet_blocks: schemas/aws/parquet/parquet_blocks.sql
- parquet_transactions: schemas/aws/parquet/parquet_transactions.sql
- parquet_token_transfers: schemas/aws/parquet/parquet_token_transfers.sql
Note that DECIMAL type is limited to 38 digits in Hive https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-decimal so values greater than 38 decimals will be null.
You can query the data that's updated daily in the public BigQuery dataset https://medium.com/@medvedev1088/ethereum-blockchain-on-google-bigquery-283fb300f579
https://github.com/blockchain-etl/awesome-bigquery-views
Read this article https://medium.com/google-cloud/how-to-query-balances-for-all-ethereum-addresses-in-bigquery-fb594e4034a7
Read this article https://medium.com/google-cloud/building-token-recommender-in-google-cloud-platform-1be5a54698eb
You can access the Ethereum dataset in Kaggle https://www.kaggle.com/bigquery/ethereum-blockchain.
- A Technical Breakdown Of Google's New Blockchain Search Tools: https://www.forbes.com/sites/michaeldelcastillo/2019/02/05/google-launches-search-for-bitcoin-ethereum-bitcoin-cash-dash-dogecoin-ethereum-classic-litecoin-and-zcash/#394fc868c789
- Navigating Bitcoin, Ethereum, XRP: How Google Is Quietly Making Blockchains Searchable: https://www.forbes.com/sites/michaeldelcastillo/2019/02/04/navigating-bitcoin-ethereum-xrp-how-google-is-quietly-making-blockchains-searchable/?ss=crypto-blockchain#49e111da4248
- Querying the Ethereum Blockchain in Snowflake: https://community.snowflake.com/s/article/Querying-the-Ethereum-Blockchain-in-Snowflake
