Newer
Older
Toplevel files:
sefht.geany:
For developing I use the IDE Geany. This file contains the
project description as well as all open files. It is included
in the VCS, because it is practical to have the position where
it was last worked on when switching branches. However this
file also often creates merge conflicts, which could be avoided,
if this file was not tracked by the VCS.
.gitignore:
Note that in this project it is choosen, to not include
generated files into the version control system, but of course
they must be always included for distributions.
configure.ac:
This package uses the GNU Autotools.
Until now, the configure script just checked for check to be installed,
which is needed to compile the tests.
Now, configure provides a conditional (MISSING_CHECK) depending on its
presence for use by automake. If check is missing, the tests aren't
compiled. Instead a special script is executed to inform the user of the
problem and stops the testsuite. Note, that it was not possible to
directly stop the generation of the testsuite by injecting a rule to a
Makefile without relying on implementation details of automake.
See:
https://stackoverflow.com/questions/76376806/automake-how-to-portably-throw-an-error-and-aborting-the-target/76382437
To allow the script to issue messages to stderr, AM_TESTS_FD_REDIRECT is
used, because the parallel test harness redirects output of its tests to
logfiles. This isn't used for the serial test harness, because there is
no redirection to logfiles, but there AM_TESTS_FD_REDIRECT is also not
taken into account.
See:
https://www.gnu.org/software/automake/manual/html_node/Testsuite-Environment-Overrides.html
Additionaly configure also provides an argument to enforce both
behaviours. When specifying --enable-tests=no the tests are not compiled
regardless of the presence of check. If --enable-tests=yes, it is
assumed, that tests are really needed and the mandantory check for check
is performed thus providing the former behaviour. If not specified
--enable-tests default to auto, which results in the same behaviour as
--enable-tests=yes, if check is present, and like --enable-tests=no
otherwise.
.gitlab-ci.yml:
This package uses a gitlab repository for version control and
also has some ci jobs. The package is setup, as the files
generated via autoreconf are not included in the vcs. Then the
package is compiled and tested. Furthermore, a release is
created and uploaded. It is accompanied by a tag naming this
nightly release. Note, that adding a tag triggers the pipeline
again, which would result in an error, as a release with the
same name can't be added twice. This is prevented with an
execution rule.
These releases are manually deleted from time to time, as they
take up space.
For uploading and creating the release, the tar-name is needed.
That's why for this there are separate shell scripts in which
configure substitutes some variables.
Note, that separating the work into different stages, using a
makefile, to determine what should be compiled, using git and
gitlab's behaviour, doesn't always works as intended. The
library used to be always recompiled, even if it has already
been compiled in the previous stage, because on git checkout,
which is done at every stage, the files get the timestamp of
the checkout-time, but the already build files, coming from
artifacts, have older times, thus resulting in a recompilation.
This is fixed with setting the timestamp of every file to the
last change git knows of.
Actually some bugs were already found due to testing the
package in another environment.
main.c:
As this project is about a library, a main.c would not be
expected. It contains a small demo program using the library.
todo.txt:
Contains features, that are discovered to be needed,
but aren't yet implemented.
General:
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
Error handling is done by the status structure. The name was
chosen in favour of error, because status is also set
independently whether an error has occurred.
The structure must be allocated by the caller,
because allocation errors may need to be handled, and in such a
case, it is unlikely, that it is possible to allocate memory.
Every function, that can fail predictably on runtime, supports
passing a pointer to a status structure as the last parameter.
Functions that can't fail detectably, doesn't have the parameter.
The structure contains an error type, the errno, the filename,
the function name, the line number and a message.
There are the following error types:
undefined: only needed to test, whether a function properly sets
the status parameter, might be removed in the future.
SUCCESS: no error has occurred.
E_ALLOC: allocation failure: malloc/realloc/calloc or strdup etc.
E_DOMAIN: Something is not representable due to a chosen type.
For example, there are more elements in an array, then
the index type supports.
E_VALUE: Some parameter had an erroneous value. For example,
an index out of bounds or a non existing reference.
E_STATE: Something is unfulfillable, due to some constraint.
E_BUG: Some unconsistent state was encountered. This always
indicates some bug in the library, not in the user
program, for this E_VALUE or E_STATE are used. However,
it might be caused by the user program manipulating
internals.
The filename, function name and line number point to the file,
where the error has occurred in the first place. This might not
be the function, that is called from the outside; filename and
function name are null-terminated strings allocated on compiletime.
The message, might be allocated on compiletime or during runtime.
The proper way of accessing the status structure is yet to be
defined. Currently the structure is accessed directly, but it
has to be considered an implementation detail. There are also
macros to check for a set status.
When an Error is detected, also an ERROR is passed to the log.
Because this isn't implemented yet, it is replaced by a call
to printf.
Unfortunately the compiler reports, that inside the macro
set_status, printf may be called with NULL [printf (NULL)],
although, this is explicitly debarred.
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
Some may argue, that in case of a fatal error, like if no memory
can be allocated, a program should just give up and crash.
However this behaviour is not very user-friendly.
There might be some cases, where it does not make sense or
it isn't possible to provide the user with some opportunity to
e.g. save a file. But it is not decidable in a library,
whether there is an option to inform the user, something must
be cleaned up or even that recovering is possible at all.
A lot of these recognized errors are a failing malloc or an
over-/underflow.
Error handling can be ignored by the caller by passing NULL to
the Error parameter. Whether an error had occurred, is also
always possible to be determined, by examining the return
value. [citation needed]
If the error occurs in a function returning a pointer, NULL will
be returned. If it returns a value, a special error value of
that type is returned, i.e. PAGE_ERR in SH_Data_register_page.
If the return type would be void otherwise, a boolean is
returned, which tells, whether the method has succeeded.
(FALSE means, that an error has occurred.)
The error may have occurred in an internal method and is passed
upwards (the stack).
Internally, errors are handled by an enum, but this must be
considered an implementation detail and can be changed in later
versions.
It is in the responsibility of the caller to recover gracefully.
It has to be assumed that the requested operation have neither
worked, nor actually took place. [citation needed]
Those the operation can be retried (hopefully).
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
raw methods:
The library provides a way to directly access the tag in a read-only
way, which saves an call to strdup. This is useful if only reading is
necessary, but needs special care by developers, as it is neither
allowed to modify it nor to free it. Disregarding this will lead to a
segfault in the best, and to silent data corruption and security bugs in
the worst case.
When there are methods in the api/abi, that take pointers to strings to
store them in the library, there are two methods to do so. Either they
are copying the string and leaving it intact, or they directly assign
the given pointer to some internal storage. While the former method, is
safer in terms of memory, as the user doesn't have to remember that he
can't use the string anymore, the latter can be more efficient, as there
is no extra strdup call, but the user is not allowed to change the
pointer, free it and also can't use the pointer, because it can't be
known whether it is already freed by the library. As it should be
decideable by the user, the library often implements both approaches,
where the method, that directly store pointers without creating a copy
contains the raw_ prefix.
goto:
Sometimes the common code to cleanup in case of an error is
bundled at the end of a function. For people complaining about
the use of goto: this is the exact use case, where it is
recommended!
splint:
The source has been adapted to splint, which still tells about some
errors, but they are all checked to be false-positives.
Classes:
CMS:
This class bundles some features and might be the entry point
of the library in the future. At the moment it doesn't do much.
- Pages are hold by Data, CMS passes trough the call(s).
Data:
This class will handle Data needed by the CMS, provides storage
for modules, manages the database connection and maybe also
contains some caches. At the moment it only provides access to
the Validator.
The two predicates SH_Data_check_tag and SH_Data_check_attr are
wrappers to the appropriate methods of the validator. These are
needed, as there shouldn't be direct calls to the internal
structure of SH_Data.
The modifying methods are not exposed, as the validator
shouldn't be changed while others depend on it, this has to be
implemented later.
Data also contains a wrapper for the self-closing tag predicate.
Attr:
The structure SH_Attr implements an HTML Attribute.
For every function there is also a static method/function,
which can perform the same work, but doesn't rely on really
having a single struct Attr. This is useful for example in an
array to manipulate a single element.
Fragment:
Fragment is the core of SeFHT. (As the name suggests)
A Fragment can be every part of a website. The website is
handled as a tree (like the DOM, but this library doesn't
implement the DOM, it only resembles it, as this is the way,
HTML works).
There are several different types of Fragments. It is
represented by an abstract base class Fragment, which contains
some methods and attributes, which are supported by every type
of fragment. But the main functionality of the base class is,
to support inheritance. For this, it contains the type of
fragment, represented by an enum, and a virtual method table,
represented by a structure of function pointers.
The data needed by a fragment is, a pointer to the Data object,
which is needed for getting any kind of information a fragment
might need, and a pointer to the parent node, which is useful
for both traversing the tree and checking for cycles, i.e. that
each fragment has exactly one parent, when a node is added.
This is necessary to prevent data corruption and also to keep
clear who is responsible, for freeing the fragment.
Both, traversing and ensuring consistency, wouldn't be possible
otherwise.
The methods each fragment has to be implement are a copy method,
a free method (destructor) and a method to output the html.
Also every class has a method, which checks, if a given fragment
is of that type.
There are currently two types of fragments: Node and Text.
There is currently no forward compatibility for more types, but
this will be added. Also modules, might be partially implemented
by a different type of fragment.
The NodeFragment represents a html tag (like a Node in the DOM)
containing all its attributes and all subsequent Nodes (a Tree).
A Fragment can contain childs. When building the html, the
childs html is generated where appropiate.
The methods
- SH_Fragment_get_child (by index)
- SH_Fragment_is_child (non recursive) and
- SH_Fragment_is_descendant (recursive)
were added.
Fragment can be copied, either recursive (copying also all
childs) or nonrecursive (ignoring the childs, thus the copy
has always no childs).
Adding the same element twice in the tree (graph) isn't
possible, as this would lead to problems e.g. double free or
similar quirks.
NodeFragment now uses the validator to validate the tags. The
attributes aren't validated yet, as this is more complicated,
because the tag is needed for that.
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
The single method (formerly SH_NodeFragment_append_child) to add a child
at the end of the child list was replaced, by a bunch of methods to
insert a child at the beginning (SH_NodeFragment_prepend_child), at the
end (SH_NodeFragment_append_child), at a specific position
(SH_NodeFragment_insert_child) and directly before
(SH_NodeFragment_insert_child_before) or after another child
(SH_NodeFragment_insert_child_after). All these methods are implemented
by a single internal one (insert_child), as there isn't really much
difference in inserting one or the other way.
But this internal method doesn't check whether this insertion request is
actually doable, to save overhead as not every insertion method requires
this check. This is done by the respective method. However if the check
is not done correctly the internal method will attempt to write at not
allocated space, which will hopefully result in a segfault.
The child list is implemented as an array. To reduce the overhead to
realloc calls, the array is allocated in chunks of childs. The
calculation how many has to be allocated is done by another static
method and determined by the macro CHILD_CHUNK. This is set to 5, which
is just a guess. It should be somewhere around the average number of
childs per html element, to reduce unused overhead.
Also some predicates (SH_NodeFragment_is_parent,
SH_NodeFragment_is_ancestor) were added to check whether a relationship
exists between to nodes, thus whether they are linked through one or
multiple levels. These functions could replace the old ones
(SH_NodeFragment_is_child, SH_NodeFragment_is_descendant) semantically.
Furthermore they are more efficient as this is now possible to check
over the parent pointer. The internal insert method also uses these
methods to check whether the child node is actually a parent of the
parent node, which would result in errors later one.
The old test is now obsolete but remained, as it is not bad to test
more.
Various remove methods were added, which are all implemented by an
static method, analog to the insert methods.
The method SH_NodeFragment_get_attr provides a pointer to an Attr, by
its index. Note, that it directly points to the internal data, instead
of copying the data to a new Attr, which would be unneccessary overhead,
if only reading access is needed. That's why it is also a const pointer.
If the user intends to modify it, a copy should be taken via
SH_Attr_copy.
Multiple insert methods allow either to add an existing Attr, or to
create a new one implicitly. If the Attr is not already used beforehand,
it is more efficient to call the attr_new methods. Also an old Attr is
freed, after it was inserted, thus it can't be used afterwards. This is
neccessary, as for efficiency reasons an array of Attr is used directly,
instead of the indirect approach of storing a pointer of Attr. This
means, that the contents of the Attr has to be copied to the internal
structure. If the old Attr would be left unfreed, there would be two
Attrs, the original one and the implicit one, referring to the same
data, which would lead to at least data corruption, or undefined
behaviour like a double free, which would be a serious threat for a
library which is to be used on a webserver. ...
For each of the two insert modes, there is a method to prepend, append
or insert at a specific position. An incorrect position is handled
inside of the external method and an E_VALUE is thrown. The internal
method doesn't handle this, so special care must be taken to not make
undefined behaviour. However enforcing this check would be unneccessary
overhead for the prepend and append methods, which are known to have
correct indicies, as well for other internal methods, where the internal
method may be used.
Two alternatives are provided: remove_attr and pop_attr. While the
former free's the Attr's data, the latter allocates a new Attr, to store
and return the data. Both functionality is provided by a single
(internal) static method.
A Fragment can output it's html. If there is an error the method
aborts and returns NULL.
This method also pays attention to self-closing tags, which is
determined via the validator.
When the wrap mode is used, after each tag a newline is started.
Also the html is indented, which can be configured by the
parameters indent_base, indent_step and indent_char. The
parameter indent_base specifies the width the first tag should
be indented with, while indent_step specifies the increment of
the indent when switching to a child tag. The character, that is
used for indenting is taken from indent_char. (It could also be
a string longer than a single character).
This arguments can't be set by the user, but are hardcoded
(by now).
The to_html method generates also the html for the attributes.
Note, that there is no escaping of the quotes, the values are
wrapped with. But this is also somewhat consistent, as there is
no syntax validation on the tags either.
(i.e. no '<' inside of a tag)
NodeFragment is virtually finished, but TextFragment is still
missing, as it depends on still not implemented functionality
of SH_Text.
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
The TextFragment is used to implement the text between and
outside html tags. Currently, it is still very rudimentary in,
that it doesn't support any operations at all and just has a
function to expose a internal text.
While this function is necessary to manipulate the content of a
TextFragment, the TextFragment should abstract the semantics of
Text. While simple wrapper functions for appending are to be
added, methods purely manipulating the text, i.e. relying on
the text's contents, wont get wrapper functions. Thus this
function is still needed until a more sophisticated approach is
implemented.
Some basic text functionality is already supported via wrapper
functions.
Note that wrapper functions aren't tested in unit tests.
When a newline is encountered in the text, a <br /> is inserted
and for wrap mode also a newline and an indent is inserted.
Note, that the indent is still missing at the front where it
can't be inserted yet as SH_Text is still lacking basic
functionality.
The html generation for both TextFragment and NodeFragment
combined is tested. As the encoding semantics of the
TextFragments are neither defined nor implemented, some tests
are marked as XFAIL.
What is still missing is the proper treatment of embed text.
This should be indented and breaked at 72/79/80. Also newlines
and special chars should be replaced on generation, maybe also
giving some way of preventing XSS. Regarding the NodeFragment
there should be some adjustments to further adjust the styling,
which of course should also be reflected by TextFragment. This
should also include the generation of self-closing tags.
Furthermore the html generation should be based on a single
text object, to which is added to. This will later on also
enable to directly send generated parts over the network while
still generating some data.
Validator:
Validator serves as an syntax checker, i.e. it can be requested
whether a tag is allowed.
On initialization (of data), the Validator's knowledge is filled
with some common tags. This is of course to be replaced later,
by some dynamic handling.
When a tag is made known to the Validator, which it already
knows, the old id is returned and nothing is added.
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
The Validator saves the tags as an array. Now also another information
is added, which slots aren't used currently to spare expensive calls to
realloc. This led to a mere reimplementation of the functions. Tags
can't be deleted by now, but the adding function supports reusing empty
slots. Also the reading functions have to determine, whether a slot can
be read or is empty.
The tests were adjusted, but are buggy, so they should be rewritten in
the future.
A registered tag can be deregistered by calling SH_Validator_deregister.
The data is removed, but the space is not deallocated, if it is not at
the end. This prevents copying data on removal and saves expensive calls
to realloc. Instead the empty space is added to the list of free blocks,
which allows to refill these spaces, if a new tag is being registered.
The space is finally deallocated, if the validator is being deallocated
or the tag written in the last block is removed. In this case, heavy
iteration is performed, as the list of free blocks is not ordered. The
next last tag at that time is determined by iterating over the list of
free blocks until some it is not found.
Note that even if there can be a lot of gaps in between, the Validator
will not allocate more space until all these gaps are refilled when a
new tag is registered, thus new space is only being allocated, if there
is really not enough space left.
Due to the 4 nested loops, there was an issue related to the
72(80)-column rule. It can't be abided without severely impacting the
readability of the code.
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
Originally the ids were intended to be useful for linking different
information together internally, and for providing references
externally. However, they weren't used internally, for this, pointers
seamed to be more useful, as they also allow to directly access the data
and also have a relation defined.
Regarding reference purposes, they aren't really needed, and it is more
convenient to directly use some strings, and they aren't more
performant, as there still have to be internal checks and looking for an
int isn't more performant, then looking for a pointer.
Also, they have to be stored, so they need more memory and also some
code, to be handled.
While it was very clever, the complex data structure of the tag array
introduced in 'Validator: restructured internal data (a0c9bb2)' comes
with a lot of runtime overhead. It reduces the calls to free and
realloc, when a lot of tags are deleted and inserted subsequently, but
burdens each call with a loop over the linked list of free blocks.
This is even more important, as validator must be fast in checking, as
this is done every time something is inserted into the DOM-tree, but has
not so tight requirements for registering new tags, as this is merely
done at startup time.
As the access must be fast, the tags are sorted when inserted, so that
the search can take place in log-time.
There is a method to add a set of tags to a validator on initialisation.
First this removes a user application from the burden of maintaining the
html spec and also is more performant, as a lot of tags are to be
inserted at once, so there aren't multiple allocation calls.
As the validator needs the tags to be in order, the tags must be sorted
on insertion. Of course it would be easier for the code, if the tags
were already in order, but first there could be easily a mistake and
second sorting the tags by an algorithm allows the tags to be specified
in a logically grouped and those more maintainable order.
For the sorting, insertion sort is used. Of course it has a worse
quadratic time complexity, but in a constructor, I wouldn't introduce
the overhead of memory managment a heap- or mergesort would introduce
and in-place sorting is also out, because the data lies in ro-memory.
Thus I choose an algorithm with constant space complexity. Also the
'long' running time is not so important, as the initilization only runs
at startup once and the tags are not likely to exceed a few hundred so
even a quadratic time isn't that bad.
Each tag has a type as defined by the html spec. This must be provided
on registration. Implicitly registering tags, when an attribute is
registered can't be done anymore, as the type information would be
missing.
The added parameterin register_tag, as well as the change of behaviourin
register_attr has broken a lot of tests, that had to be adjusted
therefor.
Added self-closing predicate. Other predicates may follow.
The Validator contains already all HTML5 tags.
Tags according to:
https://html.spec.whatwg.org/dev/indices.html#elements-3
Types according to:
https://html.spec.whatwg.org/multipage/syntax.html#elements-2
Retrieved 04. 10. 2023
A attribute can be deregistered by calling SH_Validator_deregister_attr.
Note that deregistering an attr, that was never registered is considered
an error, but this may change, as technically it is not registered
afterwards and sometimes (i.e. for a blacklist) it might be preferable
to ensure, that a specific attr is not registered, but it is not clear
whether there should be an error or not.
Also the deallocating of the data used for an attr was moved to an extra
method, as this is needed in several locations and it might be subject
to change.
The Validator can check if a attribute is allowed in a tag. It does so
by associating allowed tags with attributes. This is done in that way,
to support also attributes which are allowed for every tag (global
attributes), but this is not yet supported. So some functions allow for
NULL to be passed and some will still crash.
The predicate SH_Validator_check_attr returns whether an attribute is
allowed for a specific tag. If tag is NULL, it returns whether an attr
is allowed at all, not whether it is allowed for every tag. For this
another predicate will be provided, when this is to be implemented.
The method SH_Validator_register_attr registers an tag-attr combination.
Note, that it will automatically call SH_Validator_register_tag, if the
tag doesn't exist. Later it will be possible, to set tag to NULL to
register a global attribute, but for now the method will crash.
The method SH_Validator_deregister_attr removes a tag-attr combination
registered earlier. Note, that deregistering a non existent combination
will result in an error. This behaviour is arguable and might be subject
to change. When setting only tag to NULL, all tags for this attribute
are deregistered. When setting only attr to NULL, all attrs for this tag
are deregistered. This might suffer from problems, if this involves some
attrs, that are global. Also this will use the internal method
remove_tag_for_all_attrs, which has the problem, that it might fail
partially. Normally when failing all functions revert the program to the
same state, as it was before the call. This function however is
different, as if it fails there might be some combinations, that haven't
been removed, but others are already. Nevertheless, the validator is
still in a valid state, so it is possible to call this function a second
time, but it is not sure, which combinations are already deregistered.
As the attrs also use the internal strings of the tags, it must be
ensured, when a tag is deregistered, that all remaining references are
removed, otherwise there would be dangling pointers. Note, that for this
also remove_tag_for_all_attrs is used, so the method
SH_Validator_deregister_tag suffers from the same problems listed above.
Also if this internal method fails, the tag won't be removed at all.
Similar to the tags, the attributes can be initialized. Missing tags are
automatically added. The declaration syntax is currently a bit annoying,
as the tags, that belong to an attribute, either have to be declared
explicitly or a pointer to the tag declaration must be given, but then
only concurrent tags are possible.
Support for global attributes is likewise missing; it must be ensured,
that (tag_n != 0) && (tags != NULL). Otherwise validator will be
inconsistent and there might be a bug.
Global attributes are represented by empty attributes. A global
attribute is an attribute, that is accepted for any tag.
It is refused to remove a specific tag for a global attribute, as this
would mean to "localize" the tag, thus making it not global anymore.
The method to do that and a predicate for globalness is missing yet.
Deregistering a global attribute normally is not possible, as basically
every other tag has to be added. This was implemented now.
Originally it was intended to provide the caller with the information,
that a global attribute has to be converted into a local one before
removal. However such internals should not be exposed to the caller. As
it stands there is no real reason to inform a caller, whether an
attribute is local or global. Also, there is a problem that the
predicate is burdened with the possibility, that the attribute doesn't
exists, thus it can't return a boolean directly. Both is why, the
predicate isn't added yet.
Also a bug was detected in the method remove_tag_for_all_attrs. It
removes an attribute while also iterating over it, thus potentially
skipping over some attribute and maybe also invoking undefined behaviour
by deallocating space after the array.
Copying a Validator could be useful if multiple html versions are to be
supported. Another use case is a blacklist XSS-Scanner.
Text:
This is a data type to deal with frequently appending to a string.
The space a Text has for saving the string is allocated in chunks.
To request additional space SH_Text_enlarge is called. If the
requested size fits inside the already allocated space or is even
smaller than the current size, nothing is done. Otherwise a
multiple of chunk size is allocated being equal or greater than
the requested size. The chunk size can be changed by changing
the macro CHUNK_SIZE in src/text.h. The default is 64.
The adjustment is done automatically when a string is added.
SH_Text_append_string can be used to append a string to the text,
SH_Text_append_text can be used to append another text to the text.
SH_Text_join is a wrapper for SH_Text_append_text, but also frees
the second text, thus joining the texts to a single one.
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
The constructor SH_Text_new_from_string accepts a string, with that the
text is initialized. This can replace the so far needed two calls
SH_Text_new and SH_Text_append_string.
The (intern) implementation of SH_Text was changed from an array of
char, to a single linked list of arrays of char. This allows an easier
implementation of (further) text manipulation.
The API hasn't changed much, but SH_Text_join can't yield an error
anymore, so it now doesn't support passing an error and returns nothing.
The method SH_Text_get_char returns a single character by a given index.
If the index is out of range, NULL is returned and error->type is set to
VALUE_ERROR.
The function SH_Text_get_string returns a substring of text beginning at
index and of length offset. If index is out of bounds, NULL is returned
and an error is set. If offset is out of bounds, the existent part is
returned. Also the length of the returned string can be set (optionally)
to the out parameter length.
If the original behaviour of SH_Text_get_string is achieved,
SH_Text_get_string (text, length, error) has to be changed to
SH_Text_get_string (text, 0, SIZE_MAX, length, error). The only
difference will be that the function won't fail, when the text is longer
than SIZE_MAX, because it is told to stop there. A text that is longer
than SIZE_MAX is not possible to be returned, but that wasn't possible
at anytime. Also I don't think handling char[] longer than SIZE_MAX is
possible with the standard C library. Those in this case the text can
only be returned in parts (By now only possible till 2*SIZE_MAX-1 with
calling SH_Text_get_string (text, SIZE_MAX, SIZE_MAX, length, error))
or has to be manipulated using the appropriate SH_Text methods, which are
not implemented yet.
The function SH_Text_get_range returns a string beginning at start and
ending at end. Note that end specifies the char, that is not returned
any more. Thus the function implements something similar, as the pythonic
slice syntax (text[start:end]). In opposition to the behaviour there,
calling SH_Text_get_range with start > end is undefined behaviour. If
start == end, the empty string is returned.
If start is out of bounds, NULL is returned and an error is set. If end
is out of bounds, the existent part is returned. Also the length of the
returned string can be set (optionally) to the out parameter length.
The function SH_Text_get_length returns the length of the text. As the
text also supports being longer than SIZE_MAX, this method can fail on
runtime. If the text is longer then SIZE_MAX, the Text returns SIZE_MAX
and sets error to DOMAIN_ERROR. Note, that due to the implementation,
this is a non trivial function, so don't use it to exhaustively.
The method SH_Text_print just prints the whole string to stdout.
The function SH_Text_set_char allows to write a single character to a
position, that already exists in the text. Thus overwriting another
character. If the index is out of range, a value error is set and FALSE
is returned.
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
It was tried to implement the text in terms of multiple text
segments.
While it would be preferable, it doesn't seam to be possible to
abstract over the internals of text_segment. That's why only
some basic functionality is moved, but whether more is to
follow, is not known yet.
A text_segment allocates memory in terms of chunks, this is now
also done, when created from a string, but this means that we
can't rely on strdup any more, as it takes care of the
allocation. Calling malloc ourselves shouldn't be such an
overhead as at least glibc's strdup performs the exact same
steps. Actually we should be spare a strlen call now, so it
should be more performant.
The copy_and_replace function replaces a single character with
a string, while copying. This may be replaced by an elaborate
function as manipulating a text normally means that
manipulating is deferred until needed, which this function
contradicts to.
Also there is the concept of a text mark.
A mark will be used to point to a specific location inside of a
text. Currently it can't do anything and isn't even used.
Tests:
Tests are done using check, allowing to integrate the tests
into the GNU Autotools.
Methods that are part of another unit, but are called in a unit
aren't tested as this would interfere with the idea of unittests.
This applies for purely wrapper functions, where a call is just
passed to another unit.
Because sometimes an overflow condition is checked, it is
necessary to include the sourcefile into the test, instead of
Sometimes it isn't possible to check for correct overflow
detection by setting some number to ..._MAX, because this
number is used, thus a SIGSEGV would be raised. This is solved
by filling garbage until ..._MAX is really reached. Because
there is a timeout for the tests and it would fill RAM with
gigabytes of garbage, ..._MAX is overridden prior to inclusion
of the sourcefile.
TODO:
Log:
It is useful for debugging to actually see the error messages.