{"id":19712437,"url":"https://github.com/springernature/fs2-pdf","last_synced_at":"2025-04-29T18:30:54.762Z","repository":{"id":57727200,"uuid":"238472383","full_name":"springernature/fs2-pdf","owner":"springernature","description":"Streaming PDF processor for Scala","archived":true,"fork":false,"pushed_at":"2025-04-02T11:36:34.000Z","size":275,"stargazers_count":13,"open_issues_count":0,"forks_count":1,"subscribers_count":52,"default_branch":"main","last_synced_at":"2025-04-02T12:32:10.589Z","etag":null,"topics":["fs2","pdf","scala","scodec","stream"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/springernature.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-02-05T14:40:58.000Z","updated_at":"2025-04-02T11:41:06.000Z","dependencies_parsed_at":"2025-04-02T12:36:30.567Z","dependency_job_id":null,"html_url":"https://github.com/springernature/fs2-pdf","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/springernature%2Ffs2-pdf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/springernature%2Ffs2-pdf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/springernature%2Ffs2-pdf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/springernature%2Ffs2-pdf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/springernature","download_url":"https://codeload.github.com/springernature/fs2-pdf/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251559772,"owners_count":21609072,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fs2","pdf","scala","scodec","stream"],"created_at":"2024-11-11T22:17:08.713Z","updated_at":"2025-04-29T18:30:54.755Z","avatar_url":"https://github.com/springernature.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# :warning: Repository Status: No Longer Maintained\n\nThank you for your interest in **fs2-pdf**!\n\nUnfortunately, this project is no longer actively maintained. While the code will remain available, no further updates, bug fixes, or support will be provided.\n\n# :question: What Does This Mean?\n\n* The repository will be **archived**, making it **read-only**.\n* You are still welcome to fork the repository and use the code as needed.\n* No new issues, pull requests, or discussions will be accepted.\n\n# :raised_hands: Thank You!\nWe appreciate everyone who contributed, used, and supported this project.\n\n## About\n\n**fs2-pdf** is a Scala library for manipulating PDF files in [fs2] streams using [scodec] for parsing.\n\nPrevalent PDF manipulation tools like iText require the whole file to be read into memory, making it hard to estimate\nthe memory footprint due to large images and imposing a hard boundary for the document file size of `Int.MaxValue`.\n\n## Module ID\n\n```sbt\n\"com.springernature\" %% \"fs2-pdf\" %  \"0.1.0-RC8\"\n```\n\n# Usage\n\nThe provided `fs2` pipes convert pdf data in the shape of a `Stream[IO, Byte]` into data type encodings and back to byte\nstreams.\nRaw data is processed with [scodec] and stored as `BitVector`s and `ByteVector`s.\n\n```scala\nimport fs2.pdf._\n```\n\n## Decoding\n\nThe coarsest useful data types are provided by `wm.pdf.PdfStream.topLevel`, producing the ADT `TopLevel`:\n\n```scala\ncase class IndirectObj(obj: pdf.IndirectObj)\ncase class Version(version: pdf.Version)\ncase class Comment(data: pdf.Comment)\ncase class Xref(version: pdf.Xref)\ncase class StartXref(startxref: pdf.StartXref)\n```\n\nPDFs consist of a series of objects that look like this:\n\n```pdf\n% a compressed content stream object\n3 0 obj\n\u003c\u003c/Filter /FlateDecode /Length 19\u003e\u003e\nstream\n\u003cbinary data...\u003e\nendstream\nendobj\n\n% an array object\n112 0 obj\n[/Name 5 4 0 R (string)]\nendobj\n\n% a dict object\n2624 0 obj\n\u003c\u003c/Count 48 /Kids [2625 0 R 2626 0 R 2627 0 R 2628 0 R 2629 0 R 2630 0 R] /Parent 2623 0 R /Type /Pages\u003e\u003e\nendobj\n```\n\nThese are encoded as `IndirectObj`, with the stream being optional.\nThere are special object streams, in which a stream contains more objects; those are decoded in a later stage.\n\nAt the very beginning of a document, the version header should appear:\n\n```pdf\n%PDF-1.7\n%Ã¢Ã£ÃÃ\n\n```\n\nThe second line is optional and indicates that the PDF contains binary data streams.\n\nAt the end of a document, the cross reference table, or xref, indicates the byte offsets of the contained objects,\nlooking like this:\n\n```pdf\nxref\n0 1724\n0000000000 65535 f \n0000111287 00000 n \n0000111518 00000 n \n0000111722 00000 n \n0000111822 00000 n \n0000112053 00000 n \n...\n0000111175 00000 n \ntrailer\n\u003c\u003c\n/ID [\u003c9154668ac56ee69570067970a0db0b0a\u003e \u003c3ddbb5faba07f5306b8feb50afd4225c\u003e ]\n/Root 1685 0 R\n/Size 1724\n/Info 1683 0 R\n\u003e\u003e\nstartxref\n1493726\n%%EOF\n\n```\n\nThe dictionary after the `trailer` keyword contains metadata, in particular the `/Root` reference pointing to the object\ndescribing the pages.\n\nThe number after the `startxref` keyword denotes the byte offset of the `xref` keyword for quicker seeking in viewer apps.\n\nMultiple xrefs may occur in a document under two conditions:\n* linearized PDFs, where an additional xref at the beginning of the document references only the first page, for\n  optimized loading\n* incrementally updated PDFs, allowing authoring tools to append arbitrarily many additional objects with xrefs\n\nXrefs can be compressed into binary streams.\nIn that case, only the part starting with `startxref` will be at the end of the file, and the `StartXref` variant will\nencode this part.\n\nIn order to use this initial encoding, pipe a stream of bytes through `PdfStream.topLevel`:\n\n```scala\nval raw: Stream[IO, Byte] =\n  openAPdfFile\n\nval topLevel: Stream[IO, TopLevel] =\n  raw.through(PdfStream.topLevel)\n```\n\nFor a slightly more abstract encoding, `TopLevel` can be transformed into `Decoded` with the variants:\n\n```scala\ncase class DataObj(obj: Obj)\ncase class ContentObj(obj: Obj, rawStream: BitVector, stream: Uncompressed)\ncase class Meta(xrefs: NonEmptyList[Xref], trailer: Trailer, version: Version)\n```\n\nHere, the distinction between objects with and without streams is made and the metadata is aggregated into a unique\nrecord containing all xrefs, the aggregated trailer and the version.\n\nTo use this:\n\n```scala\nval decoded: Stream[IO, Decoded] =\n  raw.through(PdfStream.decode(Log.noop)) // you can provide a real logger with `fs2.pdf.Log.io`\n```\n\nFor another level of abstraction, the `Element` algebra represents semantics of objects:\n\n```scala\nobject DataKind\n{\n  case object General\n  case class Page(page: pdf.Page)\n  case class Pages(pages: pdf.Pages)\n  case class Array(data: Prim.Array)\n  case class FontResource(res: pdf.FontResource)\n}\n\ncase class Data(obj: Obj, kind: DataKind)\n\nobject ContentKind\n{\n  case object General\n  case class Image(image: pdf.Image)\n}\n\ncase class Content(obj: Obj, rawStream: BitVector, stream: Uncompressed, kind: ContentKind)\n\ncase class Meta(trailer: Trailer, version: Version)\n```\n\nTo use this:\n\n```scala\nval elements: Stream[IO, Element] =\n  raw.through(PdfStream.elements(Log.noop))\n```\n\n## Encoding\n\nA stream of indirect objects can be encoded into a pdf document, with automatic generation of the cross reference table.\n\n```scala\nval reencoded: Stream[IO, Byte] =\n  decoded\n    .through(Decoded.parts)\n    .through(WritePdf.parts)\n    .through(Write.bytes(\"/path/to/file.pdf\"))\n```\n\nor\n\n```scala\nval reencoded: Stream[IO, Byte] =\n  elements\n    .through(Element.parts)\n    .through(WritePdf.parts)\n    .through(Write.bytes(\"/path/to/file.pdf\"))\n```\n\nThe intermediate data type `Part` is used to carry over the trailer into the encoder.\nInstead of `Decoded.parts` and `WritePdf.parts`, you could also use `Decoded.objects` and `WritePdf.objects`, but then you would\nhave to specify a trailer dictionary for the encoder.\n`Decoded.parts` extracts this information from the input trailer.\n\n## Transforming\n\nSince you won't just want to reencode the original data, a step in between decoding and encoding should manipulate it.\nThe pipes in `Rewrite` are used in `Decoded.parts`, and they allow more complex transformations by keeping a state when\nanalyzing objects and using it to create additions to the document.\n\nThe rewrite works in two stages, `collect` and `update`:\n\n```scala\ncase class PagesState(pages: List[Pages])\nval initialState: PagesState = PagesState(Nil)\nval result: Stream[IO, ByteVector] =\n  elements\n    .through(Rewrite(initialState)(collect)(update))\n```\n\nIn `collect`, every `Element` (or `Decoded`) will be evaluated by a stateful function that allows you to prevent objects from\nbeing written and instead collect them for an update:\n\n```scala\ndef collect(state: RewriteState[PagesState]): Element =\u003e Pull[IO, Part[Trailer], RewriteState[PagesState]] = {\n  case Element.Data(_, Element.DataKind.Pages(pages)) =\u003e\n    Pull.pure(state.copy(state = state.state.copy(pages = pages :: state.state.pages)))\n  case Element.obj(obj) =\u003e\n    Pull.output1(Part.Obj(obj)) \u003e\u003e Pull.pure(state)\n  case Element.Meta(trailer, _) =\u003e\n    Pull.pure(state.copy(trailer = Some(trailer)))\n}\n```\n\nHere we get our state, wrapped in `RewriteState`, which tracks the trailer, and an `Element` passed into our function.\nThe output is `fs2.Pull`, on which you can call `Pull.output1` to instruct the stream to emit a PDF part, and `Pull.pure` to\nreturn the updated state.\n\nThere is a variant `Rewrite.simple` that hides the `Pull` from these signatures and instead expects you to return\n`(List[Part[Trailer]], RewriteState[PagesState])`.\n\nIn this example, we match on `DataKind.Pages` and do not call `Pull.output1` in this case, but add the pages to our state.\nIn the case of any other object, which we match with the convenience extractor `Element.obj`, we just pass through as a\n`Part.Obj`.\nFinally, the trailer has to be carried over in the state.\n\nIn `update`, we use the collected pages to do some analysis and then write them back to the stream:\n\n```scala\nval fontObj(number: Long): IndirectObj =\n  ???\n\ndef update(update: RewriteUpdate[PagesState]): Pull[IO, Part[Trailer], Unit] =\n  Pull.output1(Part.Trailer(update.trailer)) \u003e\u003e update.state.pages.traverse_ {\n    case Pages(index, data, _, true) =\u003e\n      val updatedData = data ++ Prim.dict(\"Resources\" -\u003e Prim.dict(\"Font\" -\u003e Prim.dict(\"F1\" -\u003e Prim.Ref(1000L, 0))))\n      Pull.output1(fontObj(1000L)) \u003e\u003e\n      Pull.output1(Part.Obj(IndirectObj(Obj(index, updatedData), None)))\n    case Pages(index, data, _, false) =\u003e\n      Pull.output1(Part.Obj(IndirectObj(Obj(index, data), None)))\n  }\n```\n\n`RewriteUpdate` is the same as `RewriteState`, except that the trailer isn't optional anymore.\nIf there is no trailer in the state at the end of the stream, an error is raised.\n\nIn this function, we first emit the trailer, then we iterate over our collected pages and match on the boolean `root`\nfield.\nIf we found the root page tree object, we first emit a custom font descriptor (not implemented here), then add\na reference to it to the page root's `Resource` dictionary (this is of course an incomplete simplification).\nFor all other pages objects, we just write the original data.\n\n## Validation\n\n```scala\nval result: IO[ValidatedNel[String, Unit]] = raw.through(PdfStream.validate(Log.noop))\n```\n\n# Limitations\n\nLinearization is not possible at the moment, since the linearization parameter dict is the first object and needs\ninformation that is only available later, like the total file size.\n\nA heuristical method for keeping already linearized documents intact is in development.\n\n# Development\n\n## Testing\n\n```bash\nops/sbt test\n```\n\n## Publishing\n\nIf you work at SpringerNature and have access to this project's pipeline, you can trigger a deployment to Maven with\nthe script at `ops/trigger-publish.bash`, which will verify some conditions and start the publish pipeline job.\n\n# License\n\nCopyright 2020 SpringerNature\n\n**fs2-pdf** is licensed under the Apache License 2.0\n\n[fs2]: https://fs2.io\n[scodec]: https://scodec.org\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspringernature%2Ffs2-pdf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fspringernature%2Ffs2-pdf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspringernature%2Ffs2-pdf/lists"}