1 -  CLIB


    I had talked to you of my clib I release for my OS...I have made a version for linux that can be used with C style call or ASM style calls(regs)...
I have modified It , including a lot of conditionnal assembly... I have defined two var for call style:
it is at the beginning of the source:
    C_CALL is for C style passing of args on stack
    ASM_CALL is for asm call style through registers

the second flags are:
        SPEEDOPT  for speed optimization
        SIZEOPT for guess what ???  yes size optimization

for the moment , you have to choose between SIZEOPT or SPEEDOPT  not the two at the same time..
idem for C_CALL or ASM_CALL
    to set the flags you just have to uncomment the related line at the beginning of the source

    OK here is the functions implemented in this first version...:

    itoa
    strlen
    memset
    memcpy
    memcpyl
    fprintf
    printf
    malloc
    calloc
    free

    first:   the memory allocation trio (malloc, free, calloc) & MemInit

    the memory allocation trio (malloc,free,calloc)  were developped for my OS first, I have tested them extensively , and I am proud of them , because they are pretty fast & short...I think they are pretty sure too , no bugs...
    The memory allocation scheme , use a sort of linked list... each memory block size is 12 bytes
structured like this...:

struc MAB
    .start resd 1                ; begginning of block (divide by 32, starting at 0  first location of the mem space)
    .size resd 1                 ; the size of the mem block (divide by 32 too)
    .pid resd 1                  ; the pid  of the process which allocate the block
endstruc

    the memory allocated block sizes are 32 bytes multiple, I could have made 1 byte memory block ,  but I think that 32 byte is good like that , it is always aligned on a cache line boundary, and it avoid memory fragmentation...
    when you pass memory size to malloc , it convert it to a 32 byte multiple size, so it is exactly compatible with standard C malloc,  idem for free & calloc...

      I have include the pid of the process that allocated the memory block , like that I will implement a sort of killmem(pid) function , which will free all the block allocated by a process, based on his pid...

    I have done some timing on numerous allocation, and I can say that it is pretty fast...
    In fact for this first version there is no multithreading locking, for avoiding different process to allocate block at the same time...
    But I will do it very soon...

    OK...so to use these memory allocation progz...
    You have to call MemInit  first... the C syntax is MemInit (memaddr, memsize) , look in the source for ASM syntax...
This function pass to the memory allocation core the adress(memaddr) & the size(memsize)  of a physical memory block, that will be used as the available memory for malloc, free, & calloc...
    For example, if you want to use my clib with you C program, you can for example allocate a block of memory of 5 MBytes with standard malloc,   then you call MemInit with the adress of the block return by C malloc , and its size....
    Then it's all, you can now work only with my mem.alloc versions of malloc, free, calloc,  like you do with the system version...

itoa: convert int to ascii

here is the core of itoa, like you see it's short, this prog can convert a 32 bit value give in eax , to a ascii string store in [edi], and it can convert it in binary,decimal, hexadecimal , octal mode
you just give the wanted mode in ecx (2,8,10,16 for example)
I was using this algorithm to convert to decimal string, and I never saw before that just in adding the
cmp dl,'9'  to process hexadecimal too , then you have an universal int to ascii converter.
I use it for my fprintf, to win space...

itoa:
        sub edx,edx
        div ecx
        test eax,eax
        jz short .print0
        push edx
        call itoa
        pop edx
.print0:
        add dl,'0'
        cmp dl,'9'
        jle short .print1
        add dl,0x27
.print1:
        mov [edi],dl
         inc edi
         ret

memset & strlen

the version of strlen & memset are the more optimized of all...
I take some idea from the assembly journal , for the sized optimize version of strlen, & speed optimized version of memset...
    the size optimized version of strlen if only 10 bytes
    the speed optimized version of strlen process 1 character in 0.75 cycle, and it auto process misalign memory reference without penalty...I think it is near optimal speed ...

    the speed optimized version of memset is fucking speed too, & if avoid misalign memory access too, it write ...I think it near from optimal speed too..

fprint - first version

I have developped printf version for my OS I used it extensively , and I think there is no bug
for the moment fprintf can process:

    %s  for string
    %d  for decimal string
    %x for hexadecimal string
    %o for octal string
    %b for binary string
    \n   line feed  or you can use  db 10  in assembler it's the same

for the moment it is the only thing processed , more will come...
I only use C style call for fprint , because there can be many args , and passing them by registers was too hard..
the string are null-terminated like in C
fprintf(filedesc,string, args, .... )
fprintf  can write to STDOUT or file
so I implement printf as a macro that do fprint(STDOUT, string, args, ...)

    I think that this fprintf version is compact & small , tell me what you think of it , & if it is interesting to implement all the printf  stuff,  or to do a compact version that just handle the most useful things...

    My fprintf version, is compatible with C standard one...

memcpy - the worst

I haven't found any good optimization for memcpy, because the source, & dest must be dword aligned to be optimize, and if one of the two if not aligned correctly you can't synchronize the two
so I implement memcpy  with rep movsb
and I do a memcpyl   with rep movsd   bu the source & dest must be dword aligned or no speed gain will be feel...


OK



so...
        here is the actual size of different compilation options on clib:

            SPEEDOPT & C_CALL        ->   790 bytes
            SIZEOPT & C_CALL            ->    618 bytes

            SPEEDOPT & ASM_CALL   ->  676 bytes
            SIZEOPT & ASM_CALL        -> 504 bytes

so the SIZE OPT  make wins 172 bytes for the moment

pretty small no ???