https://github.com/facebookarchive/linkbench

Facebook Graph Benchmark
https://github.com/facebookarchive/linkbench
Last synced: about 2 months ago
JSON representation
Facebook Graph Benchmark
Host: GitHub
URL: https://github.com/facebookarchive/linkbench
Owner: facebookarchive
License: apache-2.0
Archived: true
Created: 2012-05-22T06:58:52.000Z (almost 13 years ago)
Default Branch: master
Last Pushed: 2020-02-11T05:12:23.000Z (about 5 years ago)
Last Synced: 2024-04-08T00:14:53.664Z (about 1 year ago)
Language: Java
Size: 15.2 MB
Stars: 415
Watchers: 65
Forks: 142
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

awesome-java - LinkBench
README

        - - -

**_This project is not actively maintained. Proceed at your own risk!_**

- - -  

LinkBench Overview

====================

LinkBench is a database benchmark developed to evaluate database

performance for workloads similar to those of Facebook's production MySQL deployment.

LinkBench is highly configurable and extensible.  It can be reconfigured

to simulate a variety of workloads and plugins can be written for

benchmarking additional database systems.

LinkBench is released under the Apache License, Version 2.0.

Background

----------

One way of modeling social network data is as a *social graph*, where

entities or *nodes* such as people, posts, comments and pages are connected

by *links* which model different relationships between the nodes.  Different

types of links can represent friendship between two users, a user liking another object,

ownership of a post, or any relationship you like.  These nodes and links carry metadata such as

their type, timestamps and version numbers, along with arbitrary *payload data*.

Facebook represents much of its data in this way, with the data stored in MySQL

databases.  The goal of LinkBench is to emulate the social graph database workload

and provide a realistic benchmark for database performance on social workloads.

LinkBench's data model is based on the social graph, and LinkBench has the ability to generate

a large synthetic social graph with key properties similar to the real graph.

The workload of database operations is based on Facebook's production workload, and is also

generated in such a way that key properties of the workload match the production workload.

LinkBench Architecture

----------------------


++====================================++

||          LinkBench Driver          ||

++====================================++

||   +---------------------------+    ||

||   | Graph      | Workload     |    ||      Open connections  +=======+

||   | Generator  | Generator    |    ||   /------------------> | Graph |

||   +---------------------------+    ||  /-------------------> | Store |

||   |                           |<======---------------------> | Shard |

||   |   Graph Store Adapter     |<======---------------------> |       |

||   |   (e.g. MySQL adapter)    |<======---------------------> | e.g.  |

||   +---------------------------+    ||  \-------------------> | MySQL |

||                                    ||   \------------------> | Server|

||   ~~~~~~~~~~~~    ~~~~~~~~~~~~     ||                        +=======+

||   ~~~~~~~~~~~~    ~~~~~~~~~~~~     ||

||   ~~~~~~~~~~~~    ~~~~~~~~~~~~     ||

||   ~~~~~~~~~~~~    ~~~~~~~~~~~~     ||

||     Requester Threads              ||

++====================================++



The main software component of LinkBench is the driver, which acts as the

client to the database being benchmarked.

LinkBench is designed to support benchmarking of any database system that can support

all of the require graph operations through a *Graph Store Adapter*.

The LinkBench benchmark typically proceeds in two phases.

The first is the *load phase*,

where an initial graph is generated using the

*graph generator* and loaded into the graph store in bulk.

On a large benchmark run, this graph might have a billion nodes, and occupy over a terabyte

on disk.  The generated graph is designed to have similar properties to the Facebook

social graph.  For example, the number of links out from each node follows a power-law

distribution, where most nodes have at most a few links, but a few nodes have many

more links.

The second is the *request phase*, where the actual benchmarking occurs.  In

the request phase, the benchmark driver spawns many request threads, which make

concurrent requests to the database.  The *workload generator* is used by each

request thread to generate a series of database operations that mimics the

Facebook production workload in many aspects.  For example, the mix of

different varieties of read and write operations is the same, and the

access patterns create a similar pattern of hot (frequently access)

and cold nodes in the graph.  At the end of the request phase LinkBench

will report a range of statistics such as latency and throughput.

Getting Started

===============

In this README we'll walk you through compiling LinkBench and running

a MySQL benchmark.

Prerequisites:

--------------

These instructions assume you are using a UNIX-like system such as a Linux distribution

or Mac OS X.

**Java**: You will need a Java 7+ runtime environment.  LinkBench by default

      uses the version of Java on your path.  You can override this by setting the

      JAVA\_HOME environment variable to the directory of the desired

      Java runtime version.  You will also need a Java JDK to compile from source.

**Maven**: To build LinkBench, you will need the Apache Maven build tool. If

    you do not have it already, it is available from http://maven.apache.org .

**MySQL Connector**:  To benchmark MySQL with LinkBench, you need MySQL

    Connector/J, A version of the MySQL connector is bundled with

    LinkBench.  If you wish to use a more recent version, replace the

    mysql jar under lib/.  See http://dev.mysql.com/downloads/connector/j/

**MySQL Server**: To benchmark MySQL you will need a running MySQL

    server with free disk space.

Getting and Building LinkBench

----------------------------

First get the source code

    git clone [email protected]:facebook/linkbench.git

Then enter the directory and build LinkBench

    cd linkbench

    mvn clean package

In order to skip slower tests (some run quite long), type

    mvn clean package -P fast-test

To skip all tests

    mvn clean package -DskipTests

If the build is successful, you should get a message like this at the end of the output:

    BUILD SUCCESSFUL

    Total time: 3 seconds

If the build fails while downloading required files, you may need to configure Maven,

for example to use a proxy.  Example Maven proxy configuration is shown here:

http://maven.apache.org/guides/mini/guide-proxies.html

Now you can run the LinkBench command line tool:

    ./bin/linkbench

Running it without arguments will show a brief help message:

    Did not select benchmark mode

    usage: linkbench [-c ] [-csvstats ] [-csvstream ] [-D

           ] [-L ] [-l] [-r]

     -c                        Linkbench config file

     -csvstats,--csvstats      CSV stats output

     -csvstream,--csvstream    CSV streaming stats output

     -D              Override a config setting

     -L                        Log to this file

     -l                              Execute loading stage of benchmark

     -r                              Execute request stage of benchmark

Running a Benchmark with MySQL

==============================

In this section we will document the process of setting

up a new MySQL database and running a benchmark with LinkBench.

MySQL Setup

-----------

We need to create a new database and tables on the MySQL server.

We'll create a new database called `linkdb` and

the needed tables to store graph nodes, links and link counts.

Run the following commands in the MySQL console:

    create database linkdb;

    use linkdb;

    CREATE TABLE `linktable` (

      `id1` bigint(20) unsigned NOT NULL DEFAULT '0',

      `id2` bigint(20) unsigned NOT NULL DEFAULT '0',

      `link_type` bigint(20) unsigned NOT NULL DEFAULT '0',

      `visibility` tinyint(3) NOT NULL DEFAULT '0',

      `data` varchar(255) NOT NULL DEFAULT '',

      `time` bigint(20) unsigned NOT NULL DEFAULT '0',

      `version` int(11) unsigned NOT NULL DEFAULT '0',

      PRIMARY KEY (link_type, `id1`,`id2`),

      KEY `id1_type` (`id1`,`link_type`,`visibility`,`time`,`id2`,`version`,`data`)

    ) ENGINE=InnoDB DEFAULT CHARSET=latin1 PARTITION BY key(id1) PARTITIONS 16;

    CREATE TABLE `counttable` (

      `id` bigint(20) unsigned NOT NULL DEFAULT '0',

      `link_type` bigint(20) unsigned NOT NULL DEFAULT '0',

      `count` int(10) unsigned NOT NULL DEFAULT '0',

      `time` bigint(20) unsigned NOT NULL DEFAULT '0',

      `version` bigint(20) unsigned NOT NULL DEFAULT '0',

      PRIMARY KEY (`id`,`link_type`)

    ) ENGINE=InnoDB DEFAULT CHARSET=latin1;

    CREATE TABLE `nodetable` (

      `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,

      `type` int(10) unsigned NOT NULL,

      `version` bigint(20) unsigned NOT NULL,

      `time` int(10) unsigned NOT NULL,

      `data` mediumtext NOT NULL,

      PRIMARY KEY(`id`)

    ) ENGINE=InnoDB DEFAULT CHARSET=latin1;

You may want to set up a special database user account for benchmarking:

    -- Note: replace 'linkbench'@'localhost' with 'linkbench'@'%' to allow remote connections

    CREATE USER 'linkbench'@'localhost' IDENTIFIED BY 'mypassword';

    -- Grant all privileges on linkdb to this user

    GRANT ALL ON linkdb TO 'linkbench'@'localhost'

If you want to obtain representative benchmark results, we highly

recommend that you invest some time configuring and tuning MySQL.

MySQL performance tuning can be complex and a comprehensive guide

is beyond the scope of this readme, but here are a few basic guidelines:

* Read the [Optimization section of the MySQL user manual](http://dev.mysql.com/doc/refman/5.6/en/optimization.html).

* Make sure you have a sensible size setting for the [InnoDB buffer pool size](http://dev.mysql.com/doc/refman/5.6/en/optimizing-innodb-diskio.html),

  so as to reduce disk I/O.

* Table partitioning (as shown above) can eliminate some bottlenecks

  that occur with LinkBench where the linktable is heavily accessed.

Configuration Files

-------------------

LinkBench requires several configuration files that specify the

benchmark setup, the parameters of the graph to be generated, etc.

Before benchmarking you will want to make a copy of the example config file:

    cp config/LinkConfigMysql.properties config/MyConfig.properties

Open MyConfig.properties.  At a minimum you will need to fill in the

settings under *MySQL Connection Information* to match the server, user

and database you set up earlier. E.g.

    # MySQL connection information

    host = localhost

    user = linkbench

    password = your_password

    port = 3306

    dbid = linkdb

You can read through the settings in this file.  There are a lot of settings

that control the benchmark itself, and the output of the LinkBench

command link tool.  Notice that MyConfig.properties

references another file in this line:

    workload_file = config/FBWorkload.properties

This workload file defines how the social

graph should be generated and what mix of operations should make

up the benchmark.  The included workload file has been tuned to

match our production workload in query mix.  If you want to change

the scale of the benchmark (the default graph is quite small for

benchmarking purposes), you should look at the maxid1 setting.  This

controls the number of nodes in the initial graph created in the load

phase: increase it to get a larger database.

      # start node id (inclusive)

      startid1 = 1

      # end node id for initial load (exclusive)

      # With default config and MySQL/InnoDB, 1M ids ~= 1GB

      maxid1 = 10000001

Loading Data

------------

First we need to do an initial load of data using our new config file:

    ./bin/linkbench -c config/MyConfig.properties -l

This will take a while to load, and you should get frequent progress updates.

Once loading is finished you should see a notification like:

    LOAD PHASE COMPLETED.  Loaded 10000000 nodes (Expected 10000000).

      Loaded 47423071 links (4.74 links per node).  Took 620.4 seconds.

      Links/second = 76435

At the end LinkBench reports a range of statistics on load time that are

of limited interest at this stage.

You can significantly speed up the LinkBench load phase by making these

temporary changes in the MySQL command shell before loading:

    alter table linktable drop key `id1_type`;

    set global innodb_flush_log_at_trx_commit = 2;

    set global sync_binlog = 0;

After loading you should revert the changes:

    set global innodb_flush_log_at_trx_commit = 1;

    set global sync_binlog = 1;

    alter table linktable add key `id1_type`

      (`id1`,`link_type`,`visibility`,`time`,`id2`,`version`,`data`);

Request Phase

-------------

Now you can do some benchmarking.

Run the request phase using the below command:

    ./bin/linkbench -c config/MyConfig.properties -r

LinkBench will log progress to the console, along with statistics.

Once all requests have been sent, or the time limit has elapsed, LinkBench

will notify you of completion:

    REQUEST PHASE COMPLETED. 25000000 requests done in 2266 seconds.

      Requests/second = 11029

You can also inspect the latency statistics. For example, the following line tells us the mean latency

for link range scan operations, along with latency ranges for median (p50), 99th percentile (p99) and

so on.

    GET_LINKS_LIST count = 12678653  p25 = [0.7,0.8]ms  p50 = [1,2]ms

                   p75 = [1,2]ms  p95 = [10,11]ms  p99 = [15,16]ms

                   max = 2064.476ms  mean = 2.427ms

Advanced LinkBench Command Line Usage

-------------------------------------

Here are some further examples of how to use the LinkBench command link utility.

You can override any properties from the configuration file from the

command line with -D key=value.  For example, this runs the benchmark

with a 10 minute warmup before collecting statistics:

    ./bin/linkbench -c config/MyConfig.properties -D warmup_time=600 -r

This runs the benchmark with more detailed logging, and all output going to the file linkbench.log:

    ./bin/linkbench -c config/MyConfig.properties -D debuglevel=DEBUG -L linkbench.log -r

LinkBench supports output of statistics in csv format for easier analysis.

There are two categories of statistic: the final summary and per-thread statistics

output periodically through the benchmark.  -csvstats controls the former and -csvstream the latter:

    ./bin/linkbench -c config/MyConfig.properties -csvstats final-stats.csv -csvstreams streaming-stats.csv -r

Benchmark Guidelines

====================

Benchmarks are often controversial and are challenging to do well.

Here are some guidelines for avoiding common pitfalls with LinkBench.

Database Tuning

---------------

To remove confounding factors in database setup, there are several steps you can take

to obtain better results:

* Warm up the databases before collecting statistics.  LinkBench has a

  *warmup_time* setting that sends requests for a period before starting to

  collect statistics.

* Run benchmarks for long periods of time (hours rather than minutes)

    to reduce impact of random variation and to allow the database to

    reach a steady state.

* If at all possible, get expert help tuning the database for your

  hardware and workload.

* Benchmarks where the database fits mostly or entirely in RAM are interesting

  but aren't comparable to benchmarks where

  the database is much larger than RAM.  Typically for MySQL benchmarks our databases

  are 10-15x larger than the buffer pool.

* Databases should be benchmarked in comparable configurations.  We

  always run LinkBench with durable writes (i.e. so that

  after an operation returns, the data is written to persistent storage and can be

  recovered in the event of a system crash).

  Similarly, our LinkBench MySQL implementation provides serializable consistency of

  operations.  Weaker durability or consistency properties should be

  disclosed alongside benchmark results.

Understanding Performance Profile Under Varying Load

----------------------------------------------------

Different systems can behave different when heavily or lightly loaded.

The default benchmark settings simulate a heavily loaded database,

with 100 concurrent request threads each sending requests as quickly as they can.

Some database systems perform better than others with many concurrent

clients or heavy load, so performance under heavy load does not

give a complete picture of performance.

Typically databases are not fully loaded all of the time,

so latency of requests under moderate load is also an important

measure of database performance.

To get a better understanding of database performance under varying load it

can be helpful to:

* Modify the *requesters* parameter to test database performance with varying

  numbers of clients.

* Modify the *requestrate* config setting so that requests are throttled.

  Request latency vs. throughput curves help with understanding the full

  performance profile of a database system.

Understanding Resource Utilization

-------------------------

If you are doing a benchmark exercise, it is often a good idea to collect

additional information about system resource utilization, particularly

for CPU and I/O.  This can aid a lot in understanding and comparing

benchmark results beyond headline performance numbers.  It is

easiest to make use of collected data if you can match up timestamps

to your benchmark logs, so the examples here will append

timestamps to each line of output.

vmstat reports useful summary information on CPU and memory:

    vmstat 1 | gawk '{now=strftime("%Y-%m-%d %T "); print now $0}' > linkbench.run.1/vmstat.out

iostat reports some useful I/O statistics:

    iostat -d -x 1 |  gawk '{now=strftime("%Y-%m-%d %T "); print now $0}' > linkbench.run.1/iostat.out

Extending/Customizing LinkBench

===============================

You can customize LinkBench in several ways.

Reconfiguring Workload

---------------------

We have already introduced you to the LinkBench configuration files.

All settings in these files are documented and a great deal can be changed

simply through these configuration files.  For example:

* You can experiment with read-intensive or write-intensive workloads by

  modifying the mix of operations.

* You can alter the mix of hot/cold rows by modifying the shape

  parameter for ZipfDistribution.  If you set it close to 1, there will be only

  a few very hot nodes in the database, or if you set if close to 0, accesses

  will be spread evenly across all nodes.

Additional Workload Generators

------------------------------

It is possible to further customize the data and workload by providing

new implementations of some key classes:

* ProbabilityDistribution: which can be used to control the distribution of

    out-edges in the graph, or the access patterns for requests.

* DataGenerator: which can be used to generate data in different ways for requests.

Additional Database Systems

---------------------------

You can write plugins to benchmark additional database systems

simply by writing a Java class implementing a small set of graph operations.

Any classes implementing the `com.facebook.LinkBench.LinkStore`

and `com.facebook.LinkBench.NodeStore` interfaces can be loaded

through the *linkstore* and *nodestore* configuration file keys.

There are several steps you will have to go through to add a

new plugin .

First you need to choose you will represent LinkBench

nodes and links.  Several factors play a role in the design, but

speed of range scans and atomicity of updates are particularly important.

The MySQL schema from earlier in this README serves as a reference

implementation.

Next you need to create a new Java class, such as `public class MyStore

extends GraphStore`, and implement all of the required methods of

`LinkStore` and `NodeStore`.  Two reference implementations are provided:

`LinkStoreMysql`, a fully-fledged implementation,  and `MemoryLinkStore`,

a toy in-memory implementation.

LinkBench provides some tests to validate your implementation that you

can use during development.  If you extend any of the test classes

`LinkStoreTestBase`, `NodeStoreTestBase` and `GraphStoreTestBase` with

the required methods that set up your database, then a range of tests

will be run against it.  These tests are sanity checks rather than

comprehensive verification of your implementation.  In particular,

they do not try to verify the atomicity, consistency or durability

properties of the implementation.

Database-specific tests are not run by default.  You can enable them

with Maven profiles.  For example, to run the MySQL tests you can

run:

    mvn test -P mysql-test

The MySQL related unit tests are run against a test database that needs

setting up before running the unit tests. The default settings for this

test database are hardcoded in src/test/java/com/facebook/LinkBench/MySqlTestConfig.java.

The default settings uses localhost:3306 to connect to the database and

uses username "linkbench" and password "linkbench".  The unit test code 

creates all the required tables, so the developer needs to setup a 

MySql database called "linkbench_unittestdb" to which the linkbench user 

has permissions to create and drop tables.

**If you implement a plugin for a new database, please consider contributing

it back to the main LinkBench distribution with a pull request.**
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/facebookarchive/linkbench

Awesome Lists containing this project

README