{"id":27969030,"url":"https://github.com/prismadic/tractor-beam","last_synced_at":"2025-05-07T21:08:14.750Z","repository":{"id":212722819,"uuid":"732165980","full_name":"Prismadic/tractor-beam","owner":"Prismadic","description":"high-efficiency text \u0026 file scraper with smart tracking, client/server networking for building language model datasets fast","archived":false,"fork":false,"pushed_at":"2025-01-25T17:28:11.000Z","size":10192,"stargazers_count":7,"open_issues_count":4,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-07T21:08:06.387Z","etag":null,"topics":["botnet","cluster","data","file-downloader","llm","llm-finetuning","llm-training","mass-downloader","scraping"],"latest_commit_sha":null,"homepage":"https://prismadic.github.io/tractor-beam/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Prismadic.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-12-15T20:13:35.000Z","updated_at":"2025-01-25T17:27:16.000Z","dependencies_parsed_at":"2023-12-15T22:02:53.408Z","dependency_job_id":"866aeaad-b4eb-495f-9f84-6fd3f18a4178","html_url":"https://github.com/Prismadic/tractor-beam","commit_stats":null,"previous_names":["prismadic/tractor-beam"],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Prismadic%2Ftractor-beam","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Prismadic%2Ftractor-beam/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Prismadic%2Ftractor-beam/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Prismadic%2Ftractor-beam/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Prismadic","download_url":"https://codeload.github.com/Prismadic/tractor-beam/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252954410,"owners_count":21830905,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["botnet","cluster","data","file-downloader","llm","llm-finetuning","llm-training","mass-downloader","scraping"],"created_at":"2025-05-07T21:08:14.118Z","updated_at":"2025-05-07T21:08:14.735Z","avatar_url":"https://github.com/Prismadic.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ccenter\u003e\n\u003cp align=\"center\"\u003e\n   \u003cimg height=\"250\" width=\"250\" src=\"./tractor_beam.png\"\u003e\n   \u003cbr\u003e\n   \u003ch3 align=\"center\"\u003etractor-beam\u003c/h3\u003e\n   \u003cp align=\"center\"\u003ehigh-efficiency text \u0026 file scraper with smart tracking\u003c/p\u003e\n   \u003cp align=\"center\"\u003e\u003ci\u003e~ client/server networking for building language model datasets \u003cb\u003efast\u003c/b\u003e ~\u003c/i\u003e\u003c/p\u003e\n\u003c/p\u003e\n\n\u003c/center\u003e\n\n## 💾 Installation\n\n``` bash\npip install llm-tractor-beam\n```\n\nor\n\n``` bash\npython3 setup.py install\n```\n\n## 🛸 Tutorial\n\n[examples](https://github.com/Prismadic/tractor-beam/blob/main/examples/examples.ipynb)\n\n## 🌈 `tractor.Beam()`\n\nThe `Beam` class serves as the core engine of a highly configurable, modular library designed for parallel processing and automation of tasks such as web scraping, data downloading, processing, and storage. This class leverages various components and lower-level functions to orchestrate complex workflows. Here's an in-depth look at its roles and interactions with other components:\n\n#### ⚙️ Initialization and Configuration\n\n\u003e [!NOTE]  \n\u003e Upon initialization, the `Beam` class loads and verifies the configuration using the `Config` class. It checks if the configuration adheres to the expected structure and format, indicating the system's readiness to execute tasks as defined by the user.\n\n#### Job Processing and Workflow Management\n- **Job Processing**: The `process_job` and `_runner` methods are central to executing tasks defined in the configuration. These methods handle the execution flow of each job, including data downloading (`Abduct` class), data recording (`Visits` class), and data processing (`Focus` class). This showcases the class's ability to manage diverse tasks sequentially, ensuring each step is completed before moving to the next.\n- **Parallel and Delayed Execution**: The `go` method orchestrates the execution of all jobs, allowing for parallel processing to optimize resource utilization. It uses Python's `multiprocessing` to distribute tasks across available CPU cores, enhancing efficiency, especially for CPU-bound tasks. Additionally, it supports delayed execution for specific jobs, enabling time-controlled or periodic task execution.\n- **Resource Management**: By leveraging the `Pool` class from `multiprocessing` for parallel execution, the `Beam` class efficiently manages system resources. It calculates the optimal number of processes based on the number of available CPU cores and the number of jobs, ensuring a balance between performance and resource usage.\n\n## 📝 `utils.Config()`\n\nThe `Config` class is responsible for loading, parsing, saving, and manipulating configuration data. It can load configuration from a file or a dictionary, parse the configuration data into a structured format, save the configuration to a file, unbox the configuration by creating a project directory, create a new project directory with a configuration file, and destroy a project directory.\n\n#### Example Usage\n```python\n# Load configuration from a file\nconfig = Config('config.json')\nconfig.load_conf('config.json')\n\n# Load configuration from a dictionary\nconfig_dict = {\n    \"role\": \"watcher\",\n    \"settings\": {\n        \"name\": \"my_project\",\n        \"proj_dir\": \"/path/to/project\",\n        \"jobs\": [\n            {\n                \"url\": \"https://example.com\",\n                \"types\": [\"type1\", \"type2\"],\n                \"beacon\": \"beacon1\",\n                \"delay\": 1.5,\n                \"custom\": {\n                    \"func\": \"my_function\",\n                    \"headers\": {\"header1\": \"value1\"},\n                    \"types\": [\"type3\", \"type4\"]\n                }\n            }\n        ]\n    }\n}\nconfig.load_conf(config_dict)\n\n# Save the configuration to a file\nconfig.save()\n\n# Unbox the configuration by creating a project directory\nconfig.unbox()\n\n# Create a new project directory with a configuration file\nconfig.create()\n\n# Destroy a project directory\nconfig.destroy(confirm=\"my_project\")\n```\n\n#### Code Analysis\n##### Main functionalities\n- Load configuration from a file or a dictionary\n- Parse the configuration data into a structured format\n- Save the configuration to a file\n- Unbox the configuration by creating a project directory\n- Create a new project directory with a configuration file\n- Destroy a project directory\n___\n##### Methods\n- `__init__(self, conf: Union[str, dict, None] = None)`: Initializes a new instance of the `Config` class and loads the configuration.\n- `load_conf(self, conf)`: Loads the configuration from a file or a dictionary.\n- `parse_conf(self, conf_dict: Dict[str, Any]) -\u003e Schema`: Parses the configuration data into a structured format.\n- `save(self)`: Saves the configuration to a file.\n- `unbox(self, overwrite: bool = False)`: Unboxes the configuration by creating a project directory.\n- `create(self, config: dict = None)`: Creates a new project directory with a configuration file.\n- `destroy(self, confirm: str = None)`: Destroys a project directory.\n___\n##### Fields\n- `conf`: The parsed configuration data.\n- `conf.settings`: The settings of the configuration.\n- `conf.settings.name`: The name of the configuration.\n- `conf.settings.proj_dir`: The project directory of the configuration.\n- `conf.settings.jobs`: The list of jobs in the configuration.\n- `conf.settings.jobs.url`: The URL of a job.\n- `conf.settings.jobs.types`: The types of a job.\n- `conf.settings.jobs.beacon`: The beacon of a job.\n- `conf.settings.jobs.delay`: The delay of a job.\n- `conf.settings.jobs.custom`: The custom job data of a job.\n- `conf.settings.jobs.custom.func`: The function of a custom job.\n- `conf.settings.jobs.custom.headers`: The headers of a custom job.\n- `conf.settings.jobs.custom.types`: The types of a custom job.\n___\n\n## 🧮 `utils.BeamState()`\nThe `BeamState` class is responsible for managing the state of a beam in a laser system. It includes information about the host system, as well as the states of different components such as abduction, focus, and visit.\n\n#### Example Usage\n```python\n# Create an instance of BeamState\nbeam = BeamState()\n\n# Update the abduction state\nabduct_state = AbductState(conf={\"param\": \"value\"})\nbeam.abduct_state_update(abduct_state)\n\n# Update the focus state\nfocus_state = FocusState(conf={\"param\": \"value\"})\nbeam.focus_state_update(focus_state)\n\n# Update the visit state\nrecord_state = RecordState(conf={\"param\": \"value\"})\nbeam.record_state_update(record_state)\n\n# Update the host state\nbeam.host_state_update()\n\n# Access the current state of the beam\ncurrent_state = beam.states\n```\n\n#### Code Analysis\n##### Main functionalities\n- Get information about the host system, including platform, CPU usage, memory usage, disk usage, network I/O, etc.\n- Update and retrieve the states of different components such as abduction, focus, and visit.\n- Keep track of the history of host states.\n___\n##### Methods\n- `__init__()`: Initializes the `BeamState` class by setting the initial host info and states.\n- `get_host_info()`: Retrieves the current host information and returns a `HostInfo` object.\n- `abduct_state_update(state)`: Updates the abduction state by appending a new `AbductState` object to the `abduct` list in `states`.\n- `focus_state_update(state)`: Updates the focus state by appending a new `FocusState` object to the `focus` list in `states`.\n- `record_state_update(state)`: Updates the visit state by appending a new `RecordState` object to the `visit` list in `states`.\n- `host_state_update()`: Updates the host state by appending a new `HostInfo` object to the `host_info` list.\n___\n##### Fields\n- `host_info`: A list of `HostInfo` objects that represent the history of host states.\n- `states`: An instance of the `States` class that contains the states of different components such as abduction, focus, and visit.\n___\n\n\n## 📝 `abduct.Abduct()`\nThe `Abduct` class is responsible for downloading files from a given URL or a list of URLs. It can handle both simple URLs and URLs with recursion. It also supports the option to overwrite existing files.\n\n#### Example Usage\n```python\n# Initialize the Abduct class\nabduct = Abduct(conf=conf, job=job)\n\n# Download files from a single URL\nabduct.download()\n\n# Download files from a single URL and overwrite existing files\nabduct.download(o=True)\n\n# Download files from a single URL and specify a custom file name\nabduct.download(f=\"custom_file_name\")\n\n# Download files from a URL with recursion\nabduct.download(types=[\"pdf\", \"docx\"])\n\n# Download files from a URL with recursion and overwrite existing files\nabduct.download(types=[\"pdf\", \"docx\"], o=True)\n```\n\n#### Code Analysis\n##### Main functionalities\n- Initialize the `Abduct` class with a configuration and a job object.\n- Download files from a single URL or a list of URLs.\n- Handle URLs with recursion and filter files by their types.\n- Overwrite existing files if specified.\n___\n##### Methods\n- `__init__(self, conf: dict = None, job: Job = None)`: Initializes the `Abduct` class with a configuration and a job object. It prints an info message if the configuration is loaded successfully.\n- `_fetch_to_write(self, attachment, headers, attachment_path, file_name, block_size, o=False)`: Downloads a file from a given URL and writes it to the specified path. It appends the file information to the `state.data` list.\n- `download(self, o: bool=False, f: str=None)`: Downloads files from a URL or a list of URLs. It handles both simple URLs and URLs with recursion. It can overwrite existing files if specified. It returns the `state` object.\n___\n##### Fields\n- `state`: An instance of the `AbductState` class that stores the current state of the `Abduct` class.\n- `state.conf`: A dictionary that represents the configuration.\n- `state.job`: An instance of the `Job` class that represents the current job.\n- `state.data`: A list of dictionaries that stores the information of downloaded files. Each dictionary contains the file name and its path.\n___\n\n## 📡 `abduct.beacons.*`\n\n\"beacons\" play a crucial role in a highly customizable and modular system designed for web scraping, downloading, and processing data from various sources. These beacons, represented by modules like the Stream class, are key to achieving flexibility and modularity in the system. The structure and functionality of the \"beacons\" can be documented as follows:\n\n##### Role of Beacons\n\n#### Modularity:\nBeacons act as interchangeable modules within the system. Each beacon corresponds to a specific source or type of data (e.g., financial filings, news articles) and encapsulates the logic necessary for fetching, parsing, and processing data from that source. This modularity allows users to easily extend the system's capabilities by adding new beacons for different sources without altering the core functionality.\n#### Customizability:\nBeacons are designed to be customizable, allowing users to specify parameters and behaviors specific to the data source they target. This is evident in the Stream class, where the fetch method can be tailored to parse and retrieve data according to the unique structure of each source. \n\n\u003e [!TIP]  \n\u003e The Helpers class within a beacon further aids in bespoke processing and manipulating the fetched data\n\n#### Uniform Interface:\nDespite their differences in implementation, all beacons share a common interface, exemplified by the mandatory inclusion of a Stream class with consistent functions. This uniformity ensures that the main system can interact with any beacon in a predictable manner, facilitating ease of integration and use.\n#### Enhanced Functionality through Helpers:\nWhile the presence of a Stream class is mandatory for basic operations, the inclusion of a Helpers class within a beacon provides additional utility functions that are specific to the data or operations related to that beacon. This structure offers an extended layer of customization, enabling complex data manipulation and processing tasks that are tailored to the beacon's specific use case.\n#### Integration with the Main System:\nBeacons are seamlessly integrated into the main system, as demonstrated by the use of importlib for dynamic module loading and the structured approach to passing configurations and job details to beacons. This integration allows the system to leverage the unique capabilities of each beacon while maintaining a cohesive workflow.\n\n##### Conclusion\n\nThe \"beacons\" in this system embody the principles of modularity, customizability, and extensibility, serving as specialized modules that can be dynamically integrated to add or modify the system's data processing capabilities. By adhering to a consistent interface while allowing for beacon-specific customizations, the system achieves a balance between uniformity and flexibility, enabling it to cater to a wide range of data sources and processing requirements. This architecture not only enhances the system's utility and adaptability but also facilitates ease of maintenance and expansion, making it a robust solution for customizable and modular data processing tasks.\n\n## 🔍 `laser.Focus()`\nThe `Focus` class is responsible for processing files by reading their contents, detecting the encoding, and performing specific actions based on the file type. It uses the `Strip` class to sanitize and extract text content from XML or HTML documents. The processed data is then written to a file using the `writeme` function.\n\n#### Example Usage\n```python\n# Initialize a Focus object with a configuration and job\nfocus = Focus(conf=conf, job=job)\n\n# Process a list of files\ndata = [{'path': 'file1.xml'}, {'path': 'file2.html'}]\nresult = focus.process(data)\n\n# Destroy a file\nfocus.destroy(confirm='file1.xml')\n```\n\n#### Code Analysis\n##### Main functionalities\n- Initialize a `Focus` object with a configuration and job\n- Process files by reading their contents, detecting the encoding, and extracting text content\n- Write the processed data to a file\n- Destroy a file if the confirmation matches the file name\n___\n##### Methods\n- `__init__(self, conf: dict = None, job: Job = None)`: Initializes a `Focus` object with a configuration and job. Prints an initialization message.\n- `process(self, data: dict = None)`: Processes a list of files by reading their contents, detecting the encoding, and extracting text content. Writes the processed data to a file. Returns the updated state of the `Focus` object.\n- `destroy(self, confirm: str = None)`: Removes a file if the confirmation matches the file name. Prints a message indicating whether the file was successfully destroyed or not.\n___\n##### Fields\n- `state`: An instance of the `FocusState` class that stores the configuration and job information.\n- `state.conf`: A dictionary representing the configuration.\n- `state.job`: An instance of the `Job` class representing the job information.\n- `state.data`: A list of dictionaries representing the processed data. Each dictionary contains the path of the file and the path of the cleaned file.\n___\n\n## 📝 `visit.Visit()`\nThe `Visit` class is responsible for creating and managing records in a CSV file. It has methods for initializing the class, creating a new CSV file, seeking specific records, and writing records to the CSV file.\n\n#### Example Usage\n```python\n# Initialize the Visit class\nvisit = Visit(conf=conf, job=job)\n\n# Create a new CSV file\nvisit.create(data=data)\n\n# Seek specific records\nvisit.seek(line=2)\n\n# Write records to the CSV file\nvisit.write()\n```\n\n#### Code Analysis\n##### Main functionalities\nThe main functionalities of the `Visit` class are:\n- Initializing the class with a configuration and job object\n- Creating a new CSV file with headers and data\n- Seeking specific records in the CSV file\n- Writing records to the CSV file\n___\n##### Methods\nThe `Visit` class has the following methods:\n- `__init__(self, conf: dict = None, job: Job = None)`: Initializes the class with a configuration and job object.\n- `create(self, data: dict = None, o: bool = False)`: Creates a new CSV file with headers and data.\n- `seek(self, line: str | int = None, all: bool = False)`: Seeks specific records in the CSV file.\n- `write(self, o: bool = False, ts: bool = True, v: bool = False)`: Writes records to the CSV file.\n___\n##### Fields\nThe `Visit` class has the following fields:\n- `headers`: A list to store the headers of the CSV file.\n- `state`: An instance of the `RecordState` class that stores the configuration, job, and data of the visit.\n___","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprismadic%2Ftractor-beam","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprismadic%2Ftractor-beam","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprismadic%2Ftractor-beam/lists"}