More

alexrustic · on Dec 2, 2024

Jinbase is designed to be thread-safe, ensuring it can be reliably used in a multithreaded context.

To interact with SQLite, Jinbase uses LiteDBC [1] an SQL interface compliant with the DB-API 2.0 specification described by PEP 249 [2], itself wrapping Python's sqlite3 [3] module for a more intuitive interface and multithreading support by default. I wrote for LiteDBC a stress test [4] involving concurrency with Asyncpal [5].

[1] https://github.com/pyrustic/litedbc

[2] https://peps.python.org/pep-0249/

[3] https://docs.python.org/3/library/sqlite3.html

[4] https://github.com/pyrustic/litedbc/blob/master/tests/test_s...

[5] https://news.ycombinator.com/item?id=41404020

alexrustic · on Dec 1, 2024

Thank you for your comment ! Indeed, for wide adoption across languages, we will need to port at least Paradict as that is the format in which BLOBs are serialized.

Protobuf relies heavily on predefined schemas and this rigidity goes against the flexibility of Jinbase's schema-less philosophy.

MessagePack (or CBOR) seems more convincing but Paradict has some subtleties that I don't find there. For example, Paradict preserves UTC offsets [1], handles integer bases, allowing for the representation of integers in decimal, binary, octal, and hexadecimal formats, has an extension mechanism that I find more interesting, etc.

Soon I will be adding a command line interface to Jinbase. From the CLI one will be able to read and write any type of data and this will only be possible because Paradict has a twin text format. MessagePack, from what I know, started with 1:1 compatibility with JSON and then over time it added things that are not present in JSON, thus breaking the 1:1 compatibility with JSON.

If I understand Peter Naur's take on programming [2] correctly, Jinbase is a software idea that I'm trying to implement (bring to life) one iteration at a time, and for that I need to have some level of control over the components (like the serialization format) so that I can adjust things accordingly. For example, the Paradict binary format is originally intended to serialize and deserialize only dictionaries (P...dict), but I changed that detail so that Jinbase users can freely store other things than dictionaries.

Once the core idea is fully implemented, we will see how to reproduce it elsewhere, one contribution/compilation after another...

[1] https://codeblog.jonskeet.uk/2019/03/27/storing-utc-is-not-a...

[2] https://news.ycombinator.com/item?id=26027448

alexrustic · on Sept 2, 2024

If I understand correctly, goroutines = asyncio (non-invasive) + asyncio.to_thread (when needed, but automatic). So, it is still cooperative concurrency that feels preemptive due to its design. This Go capability is interesting, and it 'seems' that it cannot be replicated without some integration with the runtime of the target language, i.e., a 'baked-in solution'. For now, I'm fine with true preemptive concurrency, but this might change in the future.

alexrustic · on Aug 31, 2024

Thank you !

alexrustic · on Aug 31, 2024

Thanks for your comment !

When you are doing parallelism, don't forget to protect the 'entry point' of the program by using "if __name__ == '__main__'", and also avoid the __main__.py file [1]

  # this file isn't `__main__.py` !
  from asyncpal import ProcessPool

  def square(x):
      return x**2

  if __name__ == "__main__":  # very important !
      with ProcessPool(4) as pool:
          numbers = range(1000)
          # note that 'map_all' isn't lazy
          iterator = pool.map(square, numbers)  # map is lazy
          result = tuple(iterator)
          assert result == tuple(map(square, numbers))

[1] https://discuss.python.org/t/why-does-multiprocessing-not-wo...

alexrustic · on Aug 31, 2024

I was asked hours ago by a concurrency enthusiast, whose website has been a valuable source of information for me on the topic, to tell in a sentence what capability Asyncpal provides to users above the Python standard library (stdlib).

I found the question interesting because it goes straight to the point and calls for a concise answer. I believe the answer is missing from this 'Show HN'. Here is my response to the question:

Asyncpal unifies the stdlib (concurrent.futures + multiprocessing.pool) and provides true elastic pools (grow + shrink) with an intuitive interface.

alexrustic · on Aug 31, 2024

If I'm given a chance to develop, I would start with the last part of the sentence that might sound subjective. For example, the 'Future' class in 'concurrent.futures' exposes a method named 'result' to collect the result of a task. In contrast, the 'Future' class in Asyncpal exposes a 'collect' method and a 'result' property.

The stdlib's pools only grow in size and do not shrink, making them non-elastic. Therefore, I would not want to keep them alive in the background for sporadic workloads [1].

Discussions on 'concurrent.futures' vs 'multiprocessing.pool.Pool' highlight that each has unique features. While 'concurrent.futures' is the modern package, it omits some niceties found in 'multiprocessing.pool.Pool'. For example, 'concurrent.futures' has only one 'map' [2] method, which works eagerly and therefore not suitable for very long iterables [3][4]. However, I acknowledge the superiority of 'concurrent.futures.Future' [5] over 'multiprocessing.pool.AsyncResult' [6] because tasks cannot be cancelled with the latter (among other things).

[1] https://www.cloudcomputingpatterns.org/unpredictable_workloa... (related)

[2] https://docs.python.org/3/library/concurrent.futures.html#co...

[3] https://docs.python.org/3/library/multiprocessing.html#multi...

[4] https://docs.python.org/3/library/multiprocessing.html#multi...

[5] https://docs.python.org/3/library/concurrent.futures.html#fu...

[6] https://docs.python.org/3/library/multiprocessing.html#multi...

alexrustic · on March 17, 2024

Here is a ChatML document [1][2][3]:

  <|im_start|>system
  You are ChatGPT, a large language model trained by OpenAI. Answer as concisely as possible.<|im_end|>
  <|im_start|>user
  Hello world!<|im_end|>
  <|im_start|>assistant
  Hello there!<|im_end|>
  <|im_start|>system
  Now, you are John Wick. Speak like him.<|im_end|>
  <|im_start|>user
  Hello world!<|im_end|>
  assistant

As you can see, this is an XML-like format where user input must be sanitized to avoid prompt injection attacks.

Here's a Braq document [4] that uses indentation instead of XML-like tags:

  You are an AI assistant, your name is Jarvis.

  You will access the websites defined in the WEB section
  to answer the question that will be submitted to you.
  The question is stored in the 'input' key of the USER 
  dict section.

  Be kind and consider the conversation history stored
  in the 'data' key of the HISTORY dict section.

  [USER]
  timestamp = 2024-12-25T16:20:59Z
  input = (raw)
      Today, I want you to teach me prompt engineering.
      Please be concise.
      ---

  [WEB]
  https://github.com
  https://www.xanadu.net
  https://www.wikipedia.org
  https://news.ycombinator.com

  [HISTORY]
  0 = (dict)
      timestamp = 2024-12-20T13:10:51Z
      input = (raw)
          What is the name of the planet
          closest to the sun ?
          ---
      output = (raw)
          Mercury is the planet closest
          to the sun !
          ---
  1 = (dict)
      timestamp = 2024-12-22T14:15:54Z
      input = (raw)
          What is the largest planet in
          the solar system?
          ---
      output = (raw)
          Jupiter is the largest planet
          in the solar system !
          ---

User input does not need to be sanitized if it is programmatically inserted into the document as the value of a key in a regular dict section.

To work, I assume the target model needs to be trained on Braq documents with emphasis on the fact that only the top unnamed section contains root instructions (equivalent to the "system" role in ChatML).

[1] https://news.ycombinator.com/item?id=34988748

[2] https://community.openai.com/t/chatml-documentation-update/5...

[3] https://www.reddit.com/r/LocalLLaMA/comments/17u7k2d/once_an...

[4] https://github.com/pyrustic/braq?tab=readme-ov-file#ai-promp...

alexrustic · on March 17, 2024

Thank you for your comment !

At the end of your answer there is "[Format you would like the result in]". Well, I'm curious what format you want the input (the sequence you presented) to be in.

I will also be happy if you can use the 2-space indentation formatting of HN (code block) to show a practical example.

alexrustic · on March 17, 2024

Ok, let's assume hallucinations are here to stay. What do you think is the ideal format to structure AI prompts ?

BOOSTERHIDROGEN · on March 17, 2024

As the context expands, you can pour all of the sources into it, for example, https://old.reddit.com/r/ChatGPTCoding/comments/1bghp8p/i_ma...

alexrustic · on March 17, 2024

This is a tool that automates the copying and pasting of multiple source files into a Markdown document (the prompt) in order to contain an entire code base in a single prompt.

By prompt structuring format, I mean something higher level (format, language) like OpenAI's ChatML: https://news.ycombinator.com/item?id=34988748

A document generated with the project you showed me would just be "user input" inserted into a ChatML document, just below the actual OpenAI instructions defined in a system node. Here, the LLM would consume the ChatML document inside which is inserted the Markdown (containing an entire code base) generated by the tool you showed me.

alexrustic · on March 15, 2024

The backspace escape character (https://stackoverflow.com/questions/6792812/the-backspace-es...) might be a good candidate for successfully creating a valid section in a document.

In a ChatML document, this character can also help destroy the closing tag of an instruction node.

But this can only work if the escape character is actually 'executed'.