{"id":13584073,"url":"https://github.com/mirage/conan","last_synced_at":"2025-04-29T01:31:10.639Z","repository":{"id":42124490,"uuid":"248800194","full_name":"mirage/conan","owner":"mirage","description":"Like detective conan, find clue about the type of the file","archived":false,"fork":false,"pushed_at":"2025-04-14T08:42:46.000Z","size":1944,"stargazers_count":52,"open_issues_count":3,"forks_count":7,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-20T21:41:50.710Z","etag":null,"topics":["file","mime"],"latest_commit_sha":null,"homepage":null,"language":"OCaml","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mirage.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.md","contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-03-20T16:18:25.000Z","updated_at":"2025-04-14T08:42:48.000Z","dependencies_parsed_at":"2024-11-06T00:33:08.706Z","dependency_job_id":"5711dd4b-5e05-4129-ae65-84babe0ff83d","html_url":"https://github.com/mirage/conan","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mirage%2Fconan","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mirage%2Fconan/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mirage%2Fconan/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mirage%2Fconan/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mirage","download_url":"https://codeload.github.com/mirage/conan/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251415723,"owners_count":21585880,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["file","mime"],"created_at":"2024-08-01T15:03:59.631Z","updated_at":"2025-04-29T01:31:08.170Z","avatar_url":"https://github.com/mirage.png","language":"OCaml","readme":"# Conan, a detective which aggregate some clues to recognize MIME type\n\n`conan` is a re-implementation of the famous command `file`:\n```sh\n$ file --mime image.png\nimage/png\n$ conan.file --mime image.png\nimage/png\n```\n\nThis program/library (see `libmagic`) is widely used on protocols to transmit\nthe MIME type of a file. It permits then to call the right program to\nmanipulate the given file.\n\nFor instance, the HTTP protocol transmits _via_ the `Content-Type` field the\nMIME type of the body:\n```http\nHTTP/1.1 200 OK\nContent-Type: image/png\nContent-Length: 4096\n\n\u003cyour image\u003e\n```\n\nYou can find the usage of `file` into many places such as your web browser (to\nbe able to execute the right application to interpret the given file) or your\nMail User Agent.\n\nHowever, `file` is pretty old (1987), its implementation is in C, it was not\nformalized and it does not have a standard. Take `file` as an engine for the\nfile recognition can be a risk (segmentation fault, undefined behavior,\nunstable release process, etc.)\n\nBut `file` was involved for several years and it contains a great extensible\ndatabase which can be reliable due to its _seniority_. So, some famous\nsoftwares decided to re-implement a subset of `file`/`libmagic` which is less\nexpressive/powerful but it does the job in a certain expectation.\n\n## The DSL - `libmagic`\n\nThe `file`'s database use a certain language described by `man magic`:\n- a line describe an operation\n- an operation is:\n  + a test of a certain value at a certain position into the given file\n  + an anchor\n  + a _jump_ instruction to an anchor\n  + a MIME value\n  + a _strength_ value\n\nThese operations are organized as a _tree_. An operation is prepended by a\n_level_ (`\u003e`) and, from it, we are able to construct the _decision tree_ which\ndescribes multiple paths to recognize the MIME type of the given file.\n\nFor instance:\n```magic\n[0]\n\u003e [1]\n\u003e\u003e [2]\n\u003e [3]\n\u003e\u003e [4]\n\u003e\u003e\u003e [5]\n```\n\nproduces this _decision tree_:\n```\n [0]\n  | \\\n [1][3]\n  |  |\n [2][4]\n     |\n    [5]\n```\n\n### The test operation\n\nAn operation is usually a test which compares the data starting at a particular\noffset in the file with a byte value, a string or a numeric value. If the test\nsucceeds, we continue along the path according to your _decision tree_. For\ninstance, if `operation-0` succeeds, we will try `[1]` **and** `[3]`.\n\nAlong the process, we will aggregate multiple solutions which have a priority -\nsee the _strength_ value - and we will choose the highest one.\n\nThe test operations of the following fields:\n- offset: A number specifying the offset (in bytes) into the file of the data\n  which is to be tested. This offset can be _relative_ from the previous\n  operation's offset if it begins with `\u0026`.\n- type: The type of the data to be tested. We implemented many types such as\n  `byte`, `short`, `long`, `string` or `date`.\n- test: the value to be compared with the value from the file.\n- message: the message to be printed if the comparison succeeds.\n\n### Let's play!\n\nAn example is more intersting than the _theory_. Let's try to recognize a\n`zlib` archive. According to [RFC1950][] (where CMF and FLG are the first bytes\nin MSB order of an `zlib` archive):\n\n\u003e The FCHECK value must be such that CMF and FLG, when viewed as\n\u003e a 16-bit unsigned integer stored in MSB order (CMF\\*256 + FLG),\n\u003e is a multiple of 31.\n\nThen, the first byte should have a CM = 8:\n\u003e CM (Compression method)\n\u003e    This identifies the compression method used in the file. CM = 8\n\u003e    denotes the \"deflate\" compression method with a window size up\n\u003e    to 32K.\n\nAnd the RFC precises that CM should not be equal to `15` (as a reserved value),\nso we can consider that `CM \u0026 0x80` (the most significant bit) should not be\nequal to 1.\n\nFinally, we have 3 tests to do:\n1) the 16-bits number (big-endian order) must be a multiple of 31\n2) CM which is the 4 most significant bits of the first byte must be equal to 8\n3) CM should not be equal to 15 and its most significant bit should not be\n   equal to 1\n\nIn our syntax and according to the idea of a _decision tree_, we must **test**\nstep by step these assertions. At the end, we can say that the file is\nprobably an `application/zlib`:\n```magic\n0\tbeshort%31\t=0\n\u003e0\tbyte\u00260xf\t=8\n\u003e\u003e0\tbyte\u00260x80\t=0\n!:mime\tapplication/zlib\n```\n\nNow, let's play with `conan`:\n```ocaml\nopen Rresult\n\nlet zlib =\n  {file|0\tbeshort%31\t=0\n\u003e0\tbyte\u00260xf\t=8\n\u003e\u003e0\tbyte\u00260x80\t=0\n!:mime\tapplication/zlib\n|file}\n\nlet tree = R.failwith_error_msg @@ Conan_unix.tree_of_string zlib\nlet () =\n  if Array.length Sys.argv \u003e= 2\n  then\n    let m = R.failwith_error_msg @@\n      Conan_unix.run_with_tree tree Sys.argv.(1) in\n    match Conan.Metadata.mime m with\n    | Some v -\u003e Fmt.pr \"%s\\n%!\" v\n    | None -\u003e Fmt.epr \"MIME type not found.\\n%!\"\n  else Fmt.epr \"%s \u003cfilename\u003e\" Sys.argv.(0)\n```\n\nThis little program will only recognize \"application/zlib\" according to our\ndescription above. Of course, the DSL can be more complex than that!\n\n### Complex recognition\n\n#### Indirect offset\n\nOffsets do not need to be constant, but can also be read from the file being\nexamined. If the first character following the last `\u003e` is a parenthesis then\nthe string inner is interpreted as an indirect offset. value at that offset is \nread, and is used again as an offset in the file.\n\nFor instance, such _tree_ will do an indirection from the unsigned long number\n(little-endian) value available at the offset `0x3c`:\n```magic\n0\t\tstring\tMZ\n\u003e0x18\t\tleshort \u003e0x3f\n\u003e\u003e(0x3c.l)\tstring\tPE\\0\\0\tPE executable (MS-Windows)\n\u003e\u003e(0x3c.l)\tstring\tLE\\0\\0\tLX executable (OS/2)\n```\n\nYou should check the `man magic` to see the syntax and available types. You are\nable to apply a calculation if the indirect offset can not be used directly\nsuch as this example when we multiple the indirect offset with 512:\n```magic\n\u003e0x18\t\tleshort\t\u003c0x40\n\u003e\u003e(4.s*512)\tleshort\t0x014c\tCOFF executable (MS-DOS, DJGPP)\n\u003e\u003e(4.s*512)\tleshort !0x014c\tMZ executable (MS-DOS)\n```\n\n#### Relative offset\n\nMoreover you can specify an offset relative to the end of the last up-level\nfield using `\u0026` as a prefix to the offset:\n```magic\n0\t\tstring\t\tMZ\n\u003e0x18\t\tleshort\t\t\u003e0x3f\n\u003e\u003e(0x3c.l)\tstring\tPE\\0\\0\tPE executable (MS-Windows)\n\u003e\u003e\u003e\u00260\t\tleshort 0x14c\tfor Intel 80386\n\u003e\u003e\u003e\u00260\t\tleshort 0x184\tfor DEC Alpha\n```\n\nAnd, of course, indirect and relative offsets can be combined.\n\n#### Jump and recursion\n\nIt is possible to define a \"named\" magic instance that can be called from\nanother use magic entry, like a subroutine call. The offset of the subroutine\nis relative to the caller.\n\nTo be able to call a subroutine, we use the `use` operation with the name of\nthe subroutine. You don't need to define the subroutine before the caller.\nIndeed, `file` and `conan` collects all subroutines first and process then the\n_decision tree_.\n\nThis is a simple example to determine if a length of the given file is odd or\neven:\n```magic\n0\tname\t\teven\n\u003e0\tbyte\t\tx\teven\n\u003e\u003e1\tuse\t\todd\n\n0\tname\t\todd\n\u003e0\tbyte\t\tx\todd\n\u003e\u003e1\tuse\t\teven\n\n0\tbyte\t\tx\n\u003e0\tuse \t\todd\n```\n\n#### Other operations\n\nThe `libmagic` DSL implements many things but as we said, a standard of it does\nnot exist. We mostly tried to do a reverse engineering on it to implement\noperations. Some of them are not implemented - due to the lack of definitions\nor just because we did not find them into the `file`'s database. Some others are\nexplicitely not implemented because we judge them as a hack instead of an\n_homogene_ feature.\n\nThen, we are mostly focus to deliver the MIME type instead of a full\ndescription of the given file. `file` shows you many things such as the size\nof the image, the bitrate of the sound, etc. We tried to implement them but\nwe are more focused on the MIME recognition.\n\n### Experimental\n\nAccording to what we said above, `conan` is experimental and for the usage\npoint of view, it can leak exceptions such as `Unimplemented feature`.\n\nThen, even if a big work was done about types where we try to unify type of the\nexpected value and type of the test, the type expected by the message still is\nweak (for many reasons). In other words, even if we can parse and process the\n_decision tree_, we still are able to fail when we print out messages (because\nwe can not unify the type of the value and the expected type from the given\nmessage).\n\nFinally, `file` does not describe any standards about the database and `man`\npages are a bit obsolete according to what the `file` command do. For these\nreasons, it's hard to prove/and say that we have the same behavior than `file`.\nWe try to be close to what it does, but in some edge cases, we can not ensure\nthat we will produce the same result as `file`.\n\nAlso, we did not discovered everything from the given database. Even if we can\nparse and generate a decision tree from the database, some specific execution\npaths can lead to an unexpected failure. We are prompted to fix them step by\nstep of course. Feel free to test and write an issue!\n\n## MirageOS support\n\nThe other goal of `conan` is to be able to integrate the database into an\nunikernel and to give an opportunity for an application (such as a web server)\nto recognize MIME types of files.\n\n### _syscalls_\n\nAs any MirageOS projects, `conan` abstracts required _syscalls_ to introspect\na file. In this way, `conan.string` exists and it is able to recognize the MIME\ntype of a given `string` (instead of a file). `lwt` support exists too which\nmanipulate a _stream_.\n\n## Database\n\n`conan` is able to parse a database and serialize it as a full OCaml value. The\ndistribution provides 2 databases:\n- the `file`'s database\n- the previous database without extra paths which don't not tag the MIME type\n\nThe second is lighter than the first and should be used only to get the MIME\ntype. Indeed, any information such as the size of the image or the bitrate of\nthe sound are deleted.\n\nFor instance, an unikernel for Solo5 with the ligher database is around 6 MB.\n\nYou can also build your own special database. If you know that you want to\nrecognize only few objects, you can merge `tree` values for these objects and\nmake a smaller database:\n\n```ocaml\n#require \"conan-unix\" ;;\n#require \"conan-database\" ;;\n\nlet tree0 = Conan_compress.tree\nlet tree1 = Conan_ocaml.tree\nlet tree2 = Conan_audio.tree\n\nlet tree = List.fold_left Conan.Tree.merge Conan.tree.empty\n  [ tree0; tree1; tree2 ]\n\nlet recognize_ocaml_or_archive_or_audio filename =\n  Conan_unix.run_with_tree tree filename\n```\n","funding_links":[],"categories":["OCaml","\u003ca name=\"file-handling\"\u003e\u003c/a\u003eFile and file system handling","Other"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmirage%2Fconan","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmirage%2Fconan","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmirage%2Fconan/lists"}