https://github.com/composewell/haskell-perf
https://github.com/composewell/haskell-perf
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/composewell/haskell-perf
- Owner: composewell
- License: apache-2.0
- Created: 2023-06-08T22:55:59.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2024-02-05T21:06:09.000Z (over 2 years ago)
- Last Synced: 2025-03-11T17:25:25.050Z (over 1 year ago)
- Language: Haskell
- Size: 121 KB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- Changelog: Changelog.md
- License: LICENSE
Awesome Lists containing this project
README
# haskell-perf
GHC Patch: https://github.com/composewell/ghc/tree/ghc-8.10.7-eventlog-enhancements
## Enable Linux perf counters
Enable unrestricted use of perf counters:
```
# echo -1 > /proc/sys/kernel/perf_event_paranoid
```
## Disable CPU scaling
Set the scaling governer of all your cpus to `performance`:
```
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
...
...
echo performance > /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor
```
## Generating the eventlog
To generate the event log, we need to compile the program with the eventlog enabled
and run the program setting the `-l` rts option.
There are multiple ways of doing this.
__Using plain GHC__:
```
ghc Main.hs -rtsopts -eventlog
./Main +RTS -l -RTS
```
__Using Cabal__:
The `.cabal` file should contain the following ghc options
```
ghc-options: -eventlog "-with-rtsopts=-l"
```
If the `-threaded` option is used while compiling. You may want to use the `-N1`
rts option.
## Creating windows
Helper function to create windows:
```
{-# LANGUAGE BangPatterns #-}
import Control.Monad.IO.Class (MonadIO(..))
import Debug.Trace (traceEventIO)
{-# INLINE withTracingFlow #-}
withTracingFlow :: MonadIO m => String -> m a -> m a
withTracingFlow tag action = do
liftIO $ traceEventIO ("START:" ++ tag)
!res <- action
liftIO $ traceEventIO ("END:" ++ tag)
pure res
```
We can wrap parts of the flow we want to analyze with `withTracingFlow` using a
tag to help us identify it.
## End of Window
You can put the END of the window in different paths but ensure that all paths
are covered:
```
r <- f x
case r of
Just val -> do
-- _ <- L.runIO $ traceEventIO $ "END:" ++ "window"
-- Some processing
Nothing -> do
-- _ <- L.runIO $ traceEventIO $ "END:" ++ "window"
-- Some processing
```
## Measurement Overhead
Even when you are measuring an empty block of code there will be some minimum
timing and allocations reported because of the measurement overhead.
```
_ <- traceEventIO $ "START:emptyWindow"
_ <- traceEventIO $ "END:emptyWindow"
```
The timing is due to the time measurement system call itself. The allocations
are due to the traceEventIO haskell code execution. TODO: fix the allocations.
## Measurement with Lazy Evaluation
If we want to measure the cost of the lookup in the code below we need
to evaluate it right there:
```
m <- readIORef _configCache
return . snd $ SimpleLRU.lookup k m
```
For correct measurement use the following code:
```
m <- readIORef _configCache
_ <- traceEventIO $ "START:" ++ "mapLookup"
let !v = HM.lookup k m
_ <- traceEventIO $ "END:" ++ "mapLookup"
return v
```
## Labelling Threads
We should label our threads to identify the thread to scrutinize while reading
the stats.
For example,
To scrutinize the main thread:
```
import GHC.Conc (myThreadId, labelThread)
main :: IO ()
main = do
tid <- myThreadId
labelThread tid "main-thread"
withTracingFlow "main" $ do
...
```
To scrutinize the server thread in warp we can use the following middleware:
```
eventlogMiddleware :: Application -> Application
eventlogMiddleware app request respond = do
tid <- myThreadId
labelThread tid "server"
traceEventIO ("START:server")
app request respond1
where
respond1 r = do
res <- respond r
traceEventIO ("END:server")
return res
```
We can use `eventlogMiddleware` as the outermost layer.
## Reading the results
We get a lot of output currently. We are in the process of simplifying the
statistics and making the details controllable via options.
Currently, the program prints a lot of information. It's essential to understand
what to ignore given the use case.
The use-case we assume is: __Understand the window CPU time and Thread allocated__.
Consider the following program:
```
{-# LANGUAGE BangPatterns #-}
import Control.Monad (unless)
import Control.Monad.IO.Class (MonadIO(..))
import Debug.Trace (traceEventIO)
import GHC.Conc (myThreadId, labelThread)
{-# INLINE withTracingFlow #-}
withTracingFlow :: MonadIO m => String -> m a -> m a
withTracingFlow tag action = do
liftIO $ traceEventIO ("START:" ++ tag)
!res <- action
liftIO $ traceEventIO ("END:" ++ tag)
pure res
{-# INLINE printSumLoop #-}
printSumLoop :: Int -> Int -> Int -> IO ()
printSumLoop _ _ 0 = print "All Done!"
printSumLoop chunksOf from times = do
withTracingFlow "sum" $ print $ sum [from..(from + chunksOf)]
printSumLoop chunksOf (from + chunksOf) (times - 1)
main :: IO ()
main = do
tid <- myThreadId
labelThread tid "main-thread"
withTracingFlow "main" $ do
printSumLoop 10000 1 100
```
The statics gleaned from the eventlog of the above program will look like the
following:
```
--------------------------------------------------
Summary Stats
--------------------------------------------------
Global thread wise stat summary
tid label samples ThreadCPUTime ThreadAllocated
--- ----------- ------- ------------- ---------------
1 main-thread 2 967,479 434,384
2 - 1 5,854 17,664
- - 3 973,333 452,048
Window [1:main] thread wise stat summary
ProcessCPUTime: 1,174,455
ProcessUserCPUTime: 0
ProcessSystemCPUTime: 1,175,000
ThreadCPUTime:934,898
GcCPUTime:0
RtsCPUTime:239,557
tid label samples ThreadCPUTime ThreadAllocated
--- ----------- ------- ------------- ---------------
1 main-thread 1 934,898 429,952
- - 1 934,898 429,952
Window [1:sum] thread wise stat summary
ProcessCPUTime: 953,862
ProcessUserCPUTime: 0
ProcessSystemCPUTime: 949,000
ThreadCPUTime:833,991
GcCPUTime:0
RtsCPUTime:119,871
tid label samples ThreadCPUTime ThreadAllocated
--- ----------- ------- ------------- ---------------
1 main-thread 100 833,991 328,224
- - 100 833,991 328,224
--------------------------------------------------
Detailed Stats
--------------------------------------------------
Window [1:main] thread wise stats for [ThreadCPUTime]
tid label total count avg minimum maximum stddev
--- ----------- ------- ----- ------- ------- ------- ------
1 main-thread 934,898 1 934,898 934,898 934,898 0
Grand total: 934,898
Window [1:main] thread wise stats for [ThreadAllocated]
tid label total count avg minimum maximum stddev
--- ----------- ------- ----- ------- ------- ------- ------
1 main-thread 429,952 1 429,952 429,952 429,952 0
Grand total: 429,952
Window [1:sum] thread wise stats for [ThreadCPUTime]
tid label total count avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
1 main-thread 833,991 100 8,340 5,533 63,493 5,714
Grand total: 833,991
Window [1:sum] thread wise stats for [ThreadAllocated]
tid label total count avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
1 main-thread 328,224 100 3,282 2,960 31,584 2,844
Grand total: 328,224
Global thread wise stats for [ThreadCPUTime]
tid label total count avg minimum maximum stddev
--- ----------- ------- ----- ------- ------- ------- -------
1 main-thread 967,479 2 483,740 33,519 933,960 450,220
2 - 5,854 1 5,854 5,854 5,854 0
Grand total: 973,333
Global thread wise stats for [ThreadAllocated]
tid label total count avg minimum maximum stddev
--- ----------- ------- ----- ------- ------- ------- -------
1 main-thread 434,384 2 217,192 4,920 429,464 212,272
2 - 17,664 1 17,664 17,664 17,664 0
Grand total: 452,048
```
From the __Global thread wise stat summary__ under __Summary Stats__ figure out
the thread id we want to scrutinize. In this case, we care about the
`main-thread`. The thread id is `1`.
We can skip to the __Detailed Stats__ section.
We want to look at all the windows we want to scrutinize that run in the
`main-thread`. The windows in the above program are `main` and `sum`. The
thread id is prepended to the windows. So we want to look at sections
corresponding to `[1:main]` and `[1:sum]`.
That is,
```
Window [1:main] thread wise stats for [ThreadCPUTime]
tid label total count avg minimum maximum stddev
--- ----------- ------- ----- ------- ------- ------- ------
1 main-thread 934,898 1 934,898 934,898 934,898 0
Grand total: 934,898
Window [1:main] thread wise stats for [ThreadAllocated]
tid label total count avg minimum maximum stddev
--- ----------- ------- ----- ------- ------- ------- ------
1 main-thread 429,952 1 429,952 429,952 429,952 0
Grand total: 429,952
Window [1:sum] thread wise stats for [ThreadCPUTime]
tid label total count avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
1 main-thread 833,991 100 8,340 5,533 63,493 5,714
Grand total: 833,991
Window [1:sum] thread wise stats for [ThreadAllocated]
tid label total count avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
1 main-thread 328,224 100 3,282 2,960 31,584 2,844
```
Consider one specific section,
```
Window [1:sum] thread wise stats for [ThreadCPUTime]
tid label total count avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
1 main-thread 833,991 100 8,340 5,533 63,493 5,714
```
This section is a table. It has 8 columns. It can have multiple rows. We should
only scrutinize the row where the `tid` matches `main-thread`. ie. `tid == 1`.
The granularity of `ThreadCPUTime` is in nanoseconds and `ThreadAllocated` is
in bytes.
Columns:
- `tid`: The thread id
- `label`: The thread label
- `total`: The total accumulated sum of all the samples
- `count`: Number of samples or the times this window is seen
- `avg`: The average size of the samples
- `minimum`: The minimum of all the samples
- `maximum`: The maximum of all the samples
- `stddev`: The standard deviation of the samples
__NOTE__: It is important to look at `stddev`. If `stddev` is more than 30% of
the average and if the difference between the `minimum` and `maximum` is too
much, the `average` might have unecessary outliers. In the future we would like
to remove outliers automatically.