Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/chetan/sewer
A high performance, reliable pixel server
https://github.com/chetan/sewer
Last synced: 2 months ago
JSON representation
A high performance, reliable pixel server
- Host: GitHub
- URL: https://github.com/chetan/sewer
- Owner: chetan
- License: apache-2.0
- Created: 2011-12-31T04:35:08.000Z (almost 13 years ago)
- Default Branch: master
- Last Pushed: 2013-09-04T20:13:44.000Z (over 11 years ago)
- Last Synced: 2024-10-08T23:22:24.266Z (3 months ago)
- Language: Java
- Homepage: http://chetan.github.com/sewer/
- Size: 215 KB
- Stars: 7
- Watchers: 3
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Sewer - a high performance, reliable pixel server
Sewer is built for a single purpose: serving "204 No Content" responses via an embedded [Jetty](http://www.eclipse.org/jetty/) server and writing access logs to HDFS as quickly and reliably as possible.
Sewer was heavily inspired by [Apache Flume](https://cwiki.apache.org/FLUME/).
## Getting Started
1. Build or download the latest version:
$ git clone https://github.com/chetan/sewer.git
$ buildr test=no clean package
$ cp target/sewer-*.tgz /opt2. Unpack tarball:
$ tar -xzf sewer-*.tgz
3. Configure Sink:
$ vim conf/config.properties
4. Start
$ bin/sewer.sh start
That's it! Sewer should now be up and running.
# test the pixel server
$ curl -v localhost:8080
< HTTP/1.1 204 No Content# status is available on 8081 (e.g., for load balancer's that require a 200)
$ curl -v localhost:8081
< HTTP/1.1 200 OK
< Content-Length: 0# jmx agent is on 7777
$ jmx4perl http://localhost:7777/jolokia read org.eclipse.jetty.server.handler:id=0,type=statisticshandler requests
1234See [Jolokia](http://www.jolokia.org/) and [Jmx4Perl](https://metacpan.org/module/JMX::Jmx4Perl) for more about using the built-in JMX agent for monitoring and statistics gathering.
## Reliability
Sewer is designed to be extremely reliable for a number of different failure scenarios with minimal impact on performance.
It is designed to write directly to HDFS from the node which generates the event. As such, it is capable of surviving a *single downstream failure* and automatically retrying when the downstream issue has been resolved.
Types of errors that will be recovered from include:
* Network errors
* NameNode unreachable
* NameNode in safe-mode
* DateNode errors:
* HDFS create/close fails
* etc### How it works
Events are written in batches which are rotated on a timer; e.g., every 30 seconds by default. When an event is received, it is first written to disk before attempting a write to HDFS. If a batch is successfully flushed and closed, the local buffer is deleted. On failure, the buffer remains and moves into a retry queue where it will be retried asynchronously until the downstream error is resolved and the batch closes cleanly.
### Stopping
When Sewer is stopped or receives a kill signal, it will try to cleanly shutdown. First the source is closed so no more events will be received. Then it tries to cleanly close down the current event batch. If there is a downstream failure, then any open batches will be drained automatically when Sewer is started again.
### Performance Tradeoffs
For maximum I/O performance, in-memory buffers are used in several locations. Thus, if the server were to suffer a hard crash (or a kill -9) it is possible that some events will be lost. This is considered to be an acceptable tradeoff as it would be impossible to guarantee zero event loss in such a case since at a minimum, there would be some number of active HTTP requests which would not complete. These lost connections would typically outnumber those lost due to internal buffering in any case.
## Log Format
Sewer is built on Hadoop's *Writable* data format. Access log events look like the following:
long timestamp;
String ip;
String host;
String requestPath;
String queryString;
String referer;
String userAgent;
String cookies;It can be easily extended to write additional headers or handle other types of requests such as POST.
## Benchmarks
### EC2
* m1.small: 3,622 reqs/sec
* m1.large: 13,293 reqs/sec
* c1.medium: 16,556 reqs/sec
* m1.small via elb: 3,205 reqs/sec
* m1.large via elb:Methodology: 2x m1.large load generators running 'ab' twice each with the following params:
ab #{LONG_UA} -k -r -t 600 -n 500000 -c 400 #{URL}
LONG_UA = 800 byte user agent header to simulate a large payloadTests run January, 2012
## Not Quite a Flume Replacement
While Sewer uses the same source/sink pattern under the hood, it is not designed to be a drop-in Flume replacement. There is currently no master/server implementation for centrally controlling Sewer nodes, nor is there support for multiple flows or on-the-fly reconfiguration of nodes. Reconfiguration requires modifying the config file and bouncing the Sewer process.
That said, while Sewer was built for pixel serving, it should be relatively trivial to add more sources and sinks and build some of this extra functionality if so desired. In fact, there is already a basic IPC implementation modeled after Hadoop that is currently unused.
## License
Copyright 2012 Pixelcop Research, Inc.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at[http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.