{"id":35260668,"url":"https://github.com/raczben/tco_study","last_synced_at":"2026-04-01T20:22:36.968Z","repository":{"id":106208152,"uuid":"201288698","full_name":"raczben/tco_study","owner":"raczben","description":"Case study of synchronous FPGA signaling by adjusting the output timing","archived":false,"fork":false,"pushed_at":"2019-08-16T12:31:06.000Z","size":409,"stargazers_count":11,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-03-28T00:39:35.165Z","etag":null,"topics":["clock-to-output","constraint","fpga","synchronous","tco","timing","ultrascale","vivado","xilinx"],"latest_commit_sha":null,"homepage":null,"language":"Tcl","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/raczben.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2019-08-08T15:43:12.000Z","updated_at":"2024-07-24T13:23:23.000Z","dependencies_parsed_at":"2023-05-30T14:30:37.835Z","dependency_job_id":null,"html_url":"https://github.com/raczben/tco_study","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/raczben/tco_study","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raczben%2Ftco_study","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raczben%2Ftco_study/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raczben%2Ftco_study/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raczben%2Ftco_study/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/raczben","download_url":"https://codeload.github.com/raczben/tco_study/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raczben%2Ftco_study/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31291534,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-01T13:12:26.723Z","status":"ssl_error","status_checked_at":"2026-04-01T13:12:25.102Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clock-to-output","constraint","fpga","synchronous","tco","timing","ultrascale","vivado","xilinx"],"created_at":"2025-12-30T09:03:46.504Z","updated_at":"2026-04-01T20:22:36.962Z","avatar_url":"https://github.com/raczben.png","language":"Tcl","funding_links":[],"categories":[],"sub_categories":[],"readme":"# tco_study\nCase study of synchronous FPGA signaling by adjusting the output timing\n\nThis is a case-study of synchronous FPGA signaling adjust the t_co (clock-to-output) timing. This\nstudy uses Xilinx's Ultrascale architecture (more precisely the xcku040-ffva1156-2-i device),\nhowever the methodology is general and can be applied to any FPGA family.\n\n# The problem\nTodays protocol are mostly *self synchronous*, which don't need global synchronous behavior.\nHowever, in some cases we cannot avoid global synchronity. This study shows how can it be achieved using\nFPGAs even in hard timing cases.\n\nLet's assume that we want to build a [DAQ][1] (Data-acquisition) unit, which requires precision\ntrigger-timing. All modules need the trigger signal at the same time. (We need to assume that\nall modules use the same clock with a given uncertainty.)\n\n# This repository\nThis repository contains two Vivado projects. (More precisely project creator tcl files.) The first\nproject is located in the [singlecycle](single-cycle) directory. This project demonstrates three\nsimple experiments to meet the timing, but the requirements are too challenging to fulfill, so all\noutput timings fail.\n\nThe second project is located in the [multicycle](multicycle) directory. This project demonstrates\nhow to meet the timing using a *multicycle path* constraint for the output ports. These are successful\nideas, the design fits with the timing analyzer requirements.\n\n## Build\n\nTo build the projects, just open a Vivado (which supports Kintex Ultrascale devices), enter the\n[singlecycle](single-cycle) or [multicycle](multicycle) directory. Then *source* the project creator\nfile: `source create_mc_project.tcl`\n\n![Create the project](doc_resources/open_project.jpg)\n\nThen just generate the bitstream.\n\n![Generate the bitstream](doc_resources/generate_bitstream.png)\n\nTo see timing details click *Open Implemented Design*.\n\n\n# Details\nFollowing sections will walk you through from the very basic (but failing) implementations to three\nsuccessful solutions.\n\n## Timing requirements\n\nThis section is optional. You can skip to the next section, you only need to accept the minimum\n`odelay_m = 3.0` and the maximum `odelay_M = 8.0` output delays.\n\nAltera has a quite good [cookbook][2] about timing issues. Or the Xilinx's [Ultrafast design\nmethodology][3] can help to calculate timing. The following picture is from that book. Chip-to-Chip\nDesign with Virtual Clocks as Input/Output Ports:\n\n![Timing overview of synchronous devices](doc_resources/altera_timing_blockdiagram_small.png)\n\nThis study only deals with the *B* side, where the FPGA is the signal driver.\n\nHere are the output timing constraints with random values for the delays.\n(The `*_m` denotes the minimum, the `*_M` denotes the maximum values)\n\n```tcl\n# create a 100MHz clock\ncreate_clock -period 10.000 [get_ports i_clk_p]\n\n#create the associated virtual input clock\ncreate_clock -name clkB_virt -period 10\n\n#create the input delay referencing the virtual clock\n#specify the maximum external clock delay from the global oscillator towards the FPGA\nset CLK_fpga_m 3.5\nset CLK_fpga_M 4\n#specify the maximum external clock delay from the global oscillator towards the DAQ module\nset CLK_daq_m 5\nset CLK_daq_M 6.5\n#specify the maximum setup and minimum hold time of the DAQ module\nset tSUb 2\nset tHb 0.5\n#Board delay from FPGA to DAQ module (on trigger)\nset BD_trigger_m 6.5\nset BD_trigger_M 7.0\n\n# odelay_M = 8.0\n# odelay_m = 3.0\nset odelay_M [expr $CLK_fpga_M + $tSUb + $BD_trigger_M - $CLK_daq_m]\nset odelay_m [expr $CLK_fpga_m - $tHb  + $BD_trigger_m - $CLK_daq_M]\n\n#create the output maximum delay for the data output from the\n#FPGA that accounts for all delays specified (odelay_M = 8.0)\nset_output_delay -clock clkB_virt -max $odelay_M [get_ports {\u003cout_ports\u003e}]\n#create the output minimum delay for the data output from the\n#FPGA that accounts for all delays specified (odelay_m = 3.0)\nset_output_delay -clock clkB_virt -min $odelay_m [get_ports {\u003cout_ports\u003e}]\n```\n\nSo the final numbers for this study are `odelay_M = 8.0` and `odelay_m = 3.0`.\n\n---\n\n## Single-cycle implementations (fail)\n\nFirst, let's show some simple approaches, which don't need deep FPGA knowledge. Although, we will see\nthat these implementations cannot fulfill these challenging timing requirements. And finally we will use\na multi-cycle constraint in the next chapter.\n\nIn this chapter all outputs have the following output delay constraints: (See previous chapter for details)\n\n```tcl\n#create the output maximum delay for the data output from the\n#FPGA that accounts for all delays specified (odelay_M = 8.0)\nset_output_delay -clock clkB_virt -max [expr $odelay_M] [get_ports {\u003cout_ports\u003e}]\n#create the output minimum delay for the data output from the\n#FPGA that accounts for all delays specified (odelay_m = 3.0)\nset_output_delay -clock clkB_virt -min [expr $odelay_m] [get_ports {\u003cout_ports\u003e}]\n```\n\n### First (native) implementation\n\nThe [singlecycle](single-cycle) design `o_native_p` (/n) ports demonstrate the simplest version.\nSimple means a native, fabric flip-flop output connected to the output buffer.\n\n```vhdl\n-- Native\ninst_native_obufds : OBUFDS\ngeneric map(\n  IOSTANDARD =\u003e \"LVDS\"\n)\nport map(\n  O  =\u003e o_native_p,\n  OB =\u003e o_native_n,\n  I  =\u003e q_native_d2\n);\n```\n\nThis implementation will fail the timings. The timing analyzer will report negative-slack in the\nsetup time of the virtual `clkB_virt` clock:\n\n| Port name   | setup slack | hold slack |\n|-------------|-------------|------------|\n| o_native_p  | -4.421      | 5.777      |\n\nThe negative setup-slack means our signal is too slow. Let's try to make it faster!\n\n### Place into IOB\n\nAll FPGAs has a dedicated, fast output flip-flop, which is placed next to the output buffer. The\n[singlecycle](single-cycle) project `o_iob_p` (/n) ports demonstrate this solution.\n\nUsing Xilinx FPGAs the IOB property says the compiler to place the given flip-flop in the dedicated,\nfast output register. This property can be set as the following:\n`set_property IOB TRUE [get_cells \u003cregister_name\u003e]`\n\nAltough, this results a bit closer slack it still fails the timing.\n\n| Port name | setup slack | hold slack |\n|-----------|-------------|------------|\n| o_iob_p   | -3.821      | 5.586      |\n\n\n### Dedicated DDR flip-flop\n\nAnother dedicated flip-flop is located in the IO in modern FPGAs. This is the DDR flip-flop. This\napproach is implemented by the `o_ddr_p` (/n) output ports. An `ODDRE1` device primitive needs to be\nplaced in order to drive DDR data:\n\n```vhdl\nODDRE1_inst : ODDRE1\ngeneric map (\n  IS_C_INVERTED =\u003e '0',  -- Optional inversion for C\n  SRVAL =\u003e '0'           -- Initializes the ODDRE1 Flip-Flops to the specified value ('0', '1')\n)\nport map (\n  Q =\u003e w_ddr,   -- 1-bit output: Data output to IOB\n  C =\u003e w_clk,   -- 1-bit input: High-speed clock input\n  D1 =\u003e q_ddr_d2, -- 1-bit input: Parallel data input 1\n  D2 =\u003e q_ddr_d2, -- 1-bit input: Parallel data input 2\n  SR =\u003e '0'     -- 1-bit input: Active High Async Reset\n); \n```\n\nNote, that to reach the same timing behavior we need to modify the output delay constraint. The\nmaximum delay should be reduced by the half period of the system clock (ie. 5ns)\n\n```tcl\nset_output_delay -clock clkB_virt -max [expr $odelay_M -5] [get_ports {o_ddr*}]\n```\n\nIn spite of the efforts the timing fails, what's more this method has the worst results:\n\n| Port name | setup slack | hold slack |\n|-----------|-------------|------------|\n| o_iob_p   | -4.616      | 5.907      |\n\n### Summary of single-cycle\n\nThis FPGA is not fast enough to fulfill these timing requirements. The following tables show all\nthe setup/hold timings:\n\nThe setup slacks:\n\n![Single-cycle setup slacks](doc_resources/sc_setup_slacks.png)\n\nThe hold slacks:\n\n![Single-cycle hold slacks](doc_resources/sc_hold_slacks.png)\n\n---\n\n## Multicycle solutions\n\nTo understand the root cause of the failed timings we should look under hood, and need to understand\nthe timing details. The timing analyzer expects all data at the next clock edge from the launch\nclock by default (single-cycle). The following waveform shows the *required data valid window*\non the FPGA pad. The data must be valid throughout this window. (It is permitted for the signal to\nbe valid earlier or keep data even after this window, but during this slack of time the data *must* be valid.)\n\n![Single-cycle requirement](doc_resources/sc_requirement.svg)\n\n(The destination clock uncertainty and any other delays must be added/subtracted to/from odelay_M/m\nto get the accurate valid window, but now these are negligible.)\n\nLet's see one particular case. (There is no essential difference between the previously demonstrated\nfailing implementation, so let's choose the *iob* type implementation.)\n\n![valid_and_fpga_sc](doc_resources/valid_and_fpga_sc.svg)\n\nThis default (single-cycle) mode requires faster behavior, which cannot be fulfilled by this FPGA.\nHowever, the *required valid window* is shorter that the guaranteed, real valid data window.\n\nThe length of the *required valid window* is `req_len = odelay_M - odelay_m = 8 - 3 = 5`\n\nThe length of the real valid data window is  `req_len + setup_slack + hold_slack = 5 - 3.8 + 5.6 = 6.8`\n\nSo if these windows can be shifted, the timing could be closed.\n\nIn most system-synchronous cases additional fix, and known delays are acceptable. Let's shift the\n*required data valid window* with a whole clock cycle. This one (or more) clock cycle delay called *multicycle path*.\n\n![Multi-cycle requirement](doc_resources/mc_requirement.svg)\n\nIn this case the FPGA doesn't need to be as fast as in the single-cycle mode, but now it should be\nrelatively more accurate to hit the whole *required valid window*. What's more, the harder thing is not\nto violate the hold time requirements, in other words, to hold data till the end of the *required data\nvalid window*. So we can say that the FPGA has to be \"as slow as possible\".\n\nTo set the multi-cycle path only the following constraint is needed:\n\n```tcl\n# Set multicycle path for all outputs\nset_multicycle_path -to [get_ports o_*] 2\n```\n\nThe following chapters will show different implementations, which can solve this issue. To see more\ndetails open project from the [multicycle](multicycle) directory.\n\n\n### Native multicycle implementation (fails)\n\nWe have seen that the compiler cannot route as fast as required, but maybe it can solve this\nmulti-cycle path problem. So let's just implement a simple register, and connect to output port with\nthe multi-cycle constraint. This idea is implemented by the `o_native_mc_p` (/n) ports.\n\nAfter a longer compiling the timing fails in this case too.\n\n| Port name | setup slack | hold slack |\n|-----------|-------------|------------|\n| o_iob_p   | -3.555      | 0.579      |\n\nWhat happened? The compiler tried to use general routing resources to add delay to match the\nrequired data valid window. A huge routing time can be seen in the FPGA device view. Turn on the\n*Routing resources* option. ![Routing resources](doc_resources/routing_resources.png) and see the\nrouting snake:\n\n![native_mc_route_in_device_overview](doc_resources/native_mc_route_in_device_overview.png) \n![native_mc_route_in_device_zoom](doc_resources/native_mc_route_in_device_zoom.png)\n\nThe detailed timing report of this failing path is also strange. Here is the setup report, with a more\nthan 9ns routing time!\n\n![native_mc_setup_report](doc_resources/native_mc_setup_report.png)\n\nBut the same routing time in the hold report (which uses the fast model of the FPGA) is less than 5ns:\n\n![native_mc_hold_report](doc_resources/native_mc_hold_report.png)\n\nSo the problem is that the FPGA's routing resources has greater uncertainty than what the constraints\nrequire. Note, that in simpler timing requirements you can stop here, because the router will add a\nproper delay. But now we have to investigate more. Let's try to use dedicated delay elements, which\ncalled ODELAY.\n\n\n### Using dedicated delay primitive\n\nLet's try to replace the routing delays with dedicated output delays. This approach is implemented\nby the `o_odelay_p` (/n) ports of the [multicycle](multicycle) project. We need to replace the\nrouting delay of the previous (failed) solution. This was 9.4ns, with -2.4 setup slack. So we need\nto delay ~7ns.\n\nUltrascale's `ODELAYE3` primitive can delays upto 1.25ns in fixed mode. So a cascaded delay\nstructure is needed to delay ~7ns. But also note that using cascade, additional route delays added,\nso lets try with three cascaded `ODELAYE3` primitive. The cascade instantiation is described in the\n[UltraScale's SelectIO][4] user guide. \n\nWow! This is a working solution. The timing meets the requirements:\n\n| Port name | setup slack | hold slack |\n|-----------|-------------|------------|\n| o_odelay_p| 0.064       | 0.173      |\n\nHowever, both setup and hold slacks are tiny. What happened with our great valid window? Let's\nsee again the detailed timing reports (the data path delays only). \n\nSlow model (for setup calculations):\n\n![mc_odelay_setup_timing](doc_resources/mc_odelay_setup_timing.png)\n\nFast model (for hold calculations):\n\n![mc_odelay_hold_timing](doc_resources/mc_odelay_hold_timing.png)\n\nThe same effect can be read from these numbers, as from the first multi-cycle implementation. The\nFPGA's uncertainty tighten the real valid window. There is big difference between the slow (11.9)\nand fast (7.198) models data delay. Now this unwanted effect isn't strong enough, so the timing\ncould be closed, unlike the native implementation.\n\nThere are two disadvantages of this technique\n\n - The cascaded delays have relatively great uncertainty, which cannot fulfill more challenging\n constraints.\n - The other limitation of this technique is the big number of the delay elements. Cannot be delayed\n arbitrary number of outputs. The FPGA has a limited number of delay element.\n\nThe next two chapters will show a more sophisticated solution.\n\n\n### Using phase shifted clock\n\n`o_iob_shifted_clk_p` (/n) ports of the [multicycle](multicycle) project meet the timing by\nadjusting the clock of the last flip-flop. \n\n![shifted_clock_mc](doc_resources/shifted_clock_mc.svg)\n\nThis technique quasi adds extra delay to the clock path towards the FPGA (the `CLK_fpga_m` (/M) in\nthe constraint file). If the value of the `clock_shift` above equals the previously approximated\n~7ns, the value of the `tco` will be a simple output delay. The ~7ns of the `clock_shift` has to be\nconverted to phase for [Xilinx's clock wizzard][5]. `7ns/10ns*360deg = 252deg` The\n[multicycle](multicycle) project uses `240deg (6.6ns)` as phase which gives better results.\n\n![mc_shifted_clock_wizz_settings](doc_resources/mc_shifted_clock_wizz_settings.png)\n\nThe timing constraints are met again, with better results than the odelay one:\n\n| Port name            | setup slack | hold slack |\n|----------------------|-------------|------------|\n| o_iob_shifted_clk_p  | 0.850       | 1.290      |\n\nWhat great slacks! Both of setup and hold are above half a nanosecond.\n\nTwo notes for this technique:\n\n - The data have to be transferred from the `system_clk`\nto this new `shifted_clock`, which requires one (or more to help internal timing) flip-flop. The timing\nrequirements of this internal path (from `system_clk` to `shifted_clock`) is auto generated, cause a\nclock generator is used.\n - Maybe a couple of recompilations are needed with adjusted phase values, to get the better output\ntimings. First, we can think if the setup slack is greater than the hold slack, more phase shift is\nneeded, and vice versa. But it is misleading, because router can add extra internal delay, (as in\nnative implementation) which can lead us the wrong way.\n\nAltough, this technique can achive the best timing results, the FPGA will run out of clocking\nresources if great number of output should adjusted with different requirements.\n\n\n### Using inverted clock and delay primitive\n\nThe last presented method uses a mixed technology of the previous two. For implementation see\n`o_odelay_nclk_p` (/n) ports of the [multicycle](multicycle) project. Here both clock phase shift\nand delay element is used. The phase shift is special: the output flip-flop driven by the inverted\nsystem clock. The clock inversion means 50% phase shift, which is 5ns in our case. As we have seen\n~7ns total delay is needed in multicycle implementation (in this particular case). Now the clock\ninvertion grants 5ns so ~2ns additional delay is needed, which will be added using the `ODELAYE3`\ndevice primitive. This technique can also fits the timing requirements:\n\n| Port name            | setup slack | hold slack |\n|----------------------|-------------|------------|\n| o_odelay_nclk_p      | 0.687       | 1.212      |\n\nGeneral with a shifted clock and one delay element primitive a huge number of synchronous output\nsignal can be handled. The clock should be shifted according to the port with the fastest\nrequirements (the greatest odelay_M), while the fix value of the delays can be adjusted port by port.\n\nThe clock inversion has another advantage, that it does not requires PLL/MMCM module. The clock\nbuffer itself can invert the clock.\n\n\n### Summary of multicycle solutions:\n\nWe have seen three successful implementations for these challenging output requirements.\n\n| Port name             | setup slack | hold slack |\n|-----------------------|-------------|------------|\n| o_iob_p  (fail)       | -3.555      | 0.579      |\n| o_odelay_p            | 0.064       | 0.173      |\n| o_iob_shifted_clk_p   | 0.850       | 1.290      |\n| o_odelay_nclk_p       | 0.687       | 1.212      |\n\n\n## Summary\n\nI hope that you won't encounter such challenging timings, but now you can see that there is life\nafter death... \n\nClone this repository set your target device, modify the constraint files according to your\nrequirements and try to close the timings.\n\n[1]: https://en.wikipedia.org/wiki/Data_acquisition\n[2]: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/mnl_timequest_cookbook.pdf\n[3]: https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_1/ug949-vivado-design-methodology.pdf\n[4]: https://www.xilinx.com/support/documentation/user_guides/ug571-ultrascale-selectio.pdf\n[5]: https://www.xilinx.com/support/documentation/ip_documentation/clk_wiz/v6_0/pg065-clk-wiz.pdf","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraczben%2Ftco_study","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fraczben%2Ftco_study","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraczben%2Ftco_study/lists"}